Thursday, March 15, 2012

External tables in Hive are handy

Usually when you create tables in hive using raw data in HDFS, it moves them to a different location - "/user/hive/warehouse". If you created a simple table, it will be located inside the data warehouse. The following hive command creates a table with data location at "/user/hive/warehouse/user".
hive>   CREATE TABLE user(id INT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Consider that the raw data is located at "/home/admin/data1.txt" and if you issues the following hive command, the data would be moved to a new location at "/user/hive/warehouse/user/data1.txt".
hive> LOAD DATA INPATH '/home/admin/userdata/data1.txt' INTO TABLE user;

If we want to just do hive queries, it is all fine. When you drop the table, the raw data is lost as the directory corresponding to the table in warehouse is deleted.
You may also not want to delete the raw data as some one else might use it in map-reduce programs external to hive analysis. It is far more convenient to retain the data at original location via "EXTERNAL" tables.
To create external table, simply point to the location of data while creating the tables. This will ensure that the data is not moved into a location inside the warehouse directory.
hive>   CREATE TABLE user(id INT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/home/admin/userdata';

Now you could happily use both Hive HQL queries as well as hand-crafted map-reduce programs on the same data. How ever, when you drop the table, hive would attempt to delete the externally located data. This can be addressed by explicitly marking the table "EXTERNAL". Try dropping the table, you will see that the raw data is retained.
hive>   CREATE EXTERNAL TABLE user(id INT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/home/admin/userdata';

There are few more goodies in Hive that surprised me. You can overlay multiple tables all pointing to the same raw data. The following command below will ensure that there are two table with different schema overlay over the same raw data. This allows you to experiment and create new tables which improves on the previous schema. It also allows you to use different schema for different hive queries.
hive>   CREATE EXTERNAL TABLE userline(line STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/home/admin/userdata';

11 comments:

  1. Thanks a lot for your detailed example.

    ReplyDelete
  2. A bit subawesome that the LOCATION has to point to a dedicated folder and can't point to a single file though...

    ReplyDelete
    Replies
    1. But this really helps as new files can be added into the folder and is available in the table without any further engineering ...

      Delete
  3. LOAD DATA INPATH '/home/admin/userdata/data1.txt' INTO TABLE user;
    shoulb be LOAD DATA LOCAL INPATH

    ReplyDelete
  4. Is it possible to load the data into a rdbms from hive ?. If yes how do I do it, say hive to Oracle.

    ReplyDelete
  5. I'd say via ETL tool using Hive ODBC.

    ReplyDelete
  6. @Ratheesh use sqoop - http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html

    ReplyDelete
  7. Hi,

    may you please show me how to run the following command on oozie workflow:
    hive> LOAD DATA INPATH '/home/admin/userdata/data1.txt' INTO TABLE user;

    my goal is to load data to the external hive table every hour.

    ReplyDelete