## Thursday, March 15, 2012

### External tables in Hive are handy

Usually when you create tables in hive using raw data in HDFS, it moves them to a different location - "/user/hive/warehouse". If you created a simple table, it will be located inside the data warehouse. The following hive command creates a table with data location at "/user/hive/warehouse/user".
hive>   CREATE TABLE user(id INT, name STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' STORED AS TEXTFILE;


Consider that the raw data is located at "/home/admin/data1.txt" and if you issues the following hive command, the data would be moved to a new location at "/user/hive/warehouse/user/data1.txt".
hive> LOAD DATA INPATH '/home/admin/userdata/data1.txt' INTO TABLE user;


If we want to just do hive queries, it is all fine. When you drop the table, the raw data is lost as the directory corresponding to the table in warehouse is deleted.
You may also not want to delete the raw data as some one else might use it in map-reduce programs external to hive analysis. It is far more convenient to retain the data at original location via "EXTERNAL" tables.
To create external table, simply point to the location of data while creating the tables. This will ensure that the data is not moved into a location inside the warehouse directory.
hive>   CREATE TABLE user(id INT, name STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE


Now you could happily use both Hive HQL queries as well as hand-crafted map-reduce programs on the same data. How ever, when you drop the table, hive would attempt to delete the externally located data. This can be addressed by explicitly marking the table "EXTERNAL". Try dropping the table, you will see that the raw data is retained.
hive>   CREATE EXTERNAL TABLE user(id INT, name STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE


There are few more goodies in Hive that surprised me. You can overlay multiple tables all pointing to the same raw data. The following command below will ensure that there are two table with different schema overlay over the same raw data. This allows you to experiment and create new tables which improves on the previous schema. It also allows you to use different schema for different hive queries.
hive>   CREATE EXTERNAL TABLE userline(line STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE


1. Thanks a lot for your detailed example.

2. A bit subawesome that the LOCATION has to point to a dedicated folder and can't point to a single file though...

1. But this really helps as new files can be added into the folder and is available in the table without any further engineering ...

2. very rightly said

shoulb be LOAD DATA LOCAL INPATH

1. If you are mentioning LOCAL it means that the path is in your local system and not in your HDFS. If you ignore LOCAL it means that the path your giving is your HDFS path.

4. Is it possible to load the data into a rdbms from hive ?. If yes how do I do it, say hive to Oracle.

5. I'd say via ETL tool using Hive ODBC.

6. @Ratheesh use sqoop - http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html

7. Hi,

may you please show me how to run the following command on oozie workflow:

my goal is to load data to the external hive table every hour.

8. Thank you for your detailed explanation!

9. very informative...thanks...

10. Hi,

Really useful post!

Is there any way to omit file fields in external tables? In your example I would like to create an external table with just the second field of the file: name STRING

Thanks!!

11. Can one create external table as select query and store it to the desired location?

12. how do we dump data in an existing external table without deleting/moving the source files?

13. can we create hive external table with bucket and ORC option?

14. nice article great post comment information thanks for sharing.
บาคาร่าออนไลน์
gclub
GCLUB มือถือ