Where does Hive store files in HDFS?
I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly.
Where does Hive store its files in HDFS?
Hive tables may not necessarily be stored in a warehouse (since you can create tables located anywhere on the HDFS).
You should use DESCRIBE FORMATTED <table_name>
command.
hive -S -e "describe formatted <table_name> ;" | grep 'Location' | awk '{ print $NF }'
Please note that partitions may be stored in different places and to get the location of the alpha=foo/beta=bar
partition you'd have to add partition(alpha='foo',beta='bar')
after <table_name>
.
The location they are stored on the HDFS is fairly easy to figure out once you know where to look. :)
If you go to http://NAMENODE_MACHINE_NAME:50070/
in your browser it should take you to a page with a Browse the filesystem
link.
In the $HIVE_HOME/conf
directory there is the hive-default.xml
and/or hive-site.xml
which has the hive.metastore.warehouse.dir
property. That value is where you will want to navigate to after clicking the Browse the filesystem
link.
In mine, it's /usr/hive/warehouse
. Once I navigate to that location, I see the names of my tables. Clicking on a table name (which is just a folder) will then expose the partitions of the table. In my case, I currently only have it partitioned on date
. When I click on the folder at this level, I will then see files (more partitioning will have more levels). These files are where the data is actually stored on the HDFS.
I have not attempted to access these files directly, I'm assuming it can be done. I would take GREAT care if you are thinking about editing them. :)
For me - I'd figure out a way to do what I need to without direct access to the Hive data on the disk. If you need access to raw data, you can use a Hive query and output the result to a file. These will have the exact same structure (divider between columns, ect) as the files on the HDFS
. I do queries like this all the time and convert them to CSVs.
The section about how to write data from queries to disk is https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
UPDATE
Since Hadoop 3.0.0 - Alpha 1 there is a change in the default port numbers. NAMENODE_MACHINE_NAME:50070 changes to NAMENODE_MACHINE_NAME:9870. Use the latter if you are running on Hadoop 3.x. The full list of port changes are described in HDFS-9427