Thursday, March 15, 2012

External tables in Hive are handy

Usually when you create tables in hive using raw data in HDFS, it moves them to a different location - "/user/hive/warehouse". If you created a simple table, it will be located inside the data warehouse. The following hive command creates a table with data location at "/user/hive/warehouse/user".
hive>   CREATE TABLE user(id INT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Consider that the raw data is located at "/home/admin/data1.txt" and if you issues the following hive command, the data would be moved to a new location at "/user/hive/warehouse/user/data1.txt".
hive> LOAD DATA INPATH '/home/admin/userdata/data1.txt' INTO TABLE user;

If we want to just do hive queries, it is all fine. When you drop the table, the raw data is lost as the directory corresponding to the table in warehouse is deleted.
You may also not want to delete the raw data as some one else might use it in map-reduce programs external to hive analysis. It is far more convenient to retain the data at original location via "EXTERNAL" tables.
To create external table, simply point to the location of data while creating the tables. This will ensure that the data is not moved into a location inside the warehouse directory.
hive>   CREATE TABLE user(id INT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/home/admin/userdata';

Now you could happily use both Hive HQL queries as well as hand-crafted map-reduce programs on the same data. How ever, when you drop the table, hive would attempt to delete the externally located data. This can be addressed by explicitly marking the table "EXTERNAL". Try dropping the table, you will see that the raw data is retained.
hive>   CREATE EXTERNAL TABLE user(id INT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/home/admin/userdata';

There are few more goodies in Hive that surprised me. You can overlay multiple tables all pointing to the same raw data. The following command below will ensure that there are two table with different schema overlay over the same raw data. This allows you to experiment and create new tables which improves on the previous schema. It also allows you to use different schema for different hive queries.
hive>   CREATE EXTERNAL TABLE userline(line STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/home/admin/userdata';

25 comments:

  1. Thanks a lot for your detailed example.

    ReplyDelete
  2. A bit subawesome that the LOCATION has to point to a dedicated folder and can't point to a single file though...

    ReplyDelete
    Replies
    1. But this really helps as new files can be added into the folder and is available in the table without any further engineering ...

      Delete
  3. LOAD DATA INPATH '/home/admin/userdata/data1.txt' INTO TABLE user;
    shoulb be LOAD DATA LOCAL INPATH

    ReplyDelete
    Replies
    1. If you are mentioning LOCAL it means that the path is in your local system and not in your HDFS. If you ignore LOCAL it means that the path your giving is your HDFS path.

      Delete
  4. Is it possible to load the data into a rdbms from hive ?. If yes how do I do it, say hive to Oracle.

    ReplyDelete
  5. I'd say via ETL tool using Hive ODBC.

    ReplyDelete
  6. @Ratheesh use sqoop - http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html

    ReplyDelete
  7. Hi,

    may you please show me how to run the following command on oozie workflow:
    hive> LOAD DATA INPATH '/home/admin/userdata/data1.txt' INTO TABLE user;

    my goal is to load data to the external hive table every hour.

    ReplyDelete
  8. Thank you for your detailed explanation!

    ReplyDelete
  9. Hi,

    Really useful post!

    Is there any way to omit file fields in external tables? In your example I would like to create an external table with just the second field of the file: name STRING

    Thanks!!

    ReplyDelete
  10. Can one create external table as select query and store it to the desired location?

    ReplyDelete
  11. how do we dump data in an existing external table without deleting/moving the source files?

    ReplyDelete
  12. can we create hive external table with bucket and ORC option?

    ReplyDelete
  13. Thankful to you for this amazing information sharing with us. Get website designing and development services by Ogen Infosystem.
    Website Designing Company in Delhi

    ReplyDelete
  14. Decent, Get Service for Night out page 3 parties and this magnificent service provided by Lifestyle Magazine.
    Lifestyle Magazine India

    ReplyDelete
  15. Great, I think this is one of the best blog in past some time I have seen. Visit Kalakutir for Fleet Painting, Godown Line Marking Painting and Caution & Indication Signages.
    Fleet Painting

    ReplyDelete
  16. Gteh Stocktwits Real-Time Overview Of A Stock, Including Recent And Historical Price Charts, News, Events, Analyst Rating Changes And Other Key Stock Information.

    ReplyDelete
  17. Explore top-notch Docker consulting services to optimize containerization strategies. Unlock efficiency and seamless deployment for your business success.

    ReplyDelete
  18. BuyCloudlinux License CloudLinux is an operating system built for web hosting purposes, offering commercial support and bolstering density, stability, and security in shared hosting server environments.

    ReplyDelete