Thursday, June 7, 2012

Ingest data from database into Hadoop with Sqoop (1)

Sqoop is an easy tool to import data from databases to HDFS and export data from Hadoop/Hive tables to Databases. Databases has been de-facto standard for storing structured data. Running complex queries on large data in databases can be detrimental to their performance.
It is some times useful to import the data into Hadoop for ad hoc analysis. Tools like hive, raw map-reduce can provide tremendous flexibility in performing various kinds of analysis.
This becomes particularly useful when database has been used mostly as storage device (Ex: Storing XML or unstructured string data as clob data).

Sqoop is very simple on it's face. Internally, it uses map-reduce in parallel data import from Database and utilizes JDBC connection for the purpose.

I am jumping straight into using sqoop with oracle database and will leave installation for some other post.

Sqoop commands are executed from command lines using following structure:
sqoop COMMAND [ARGS]
All available sqoop commands can be listed with: sqoop help

Article focuses on importing from database specifically Oracle DB.
All commands displayed multiline has to be run as a single command.

List database schema present on Oracle server.

User credentials should have have sufficient permission to list the databases.
sqoop list-databases 
             --connect jdbc:oracle:thin:@//HOST:PORT
             --username DBA_USER 
             --password password

Providing password on command line is insecure and we should be using "-P" instead. Sqoop will prompt for the password when you execute the command.

List tables present in a Oracle DB.

sqoop list-tables 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P

Run arbitrary sql command using Sqoop
sqoop eval 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --query "SQL QUERY"

List columns for table in Oracle DB
sqoop eval 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --query "SELECT  t.*  FROM  TABLENAME  t WHERE 1=0"

Importing tables from Oracle DB
There are many destinations from importing data from Oracle DB. You can import data to
  • HDFS as raw files
  • Hive table existing or newly created
  • HBase tables

I have not explored all of them but I definitely prefer storing them as raw files over hive tables. This makes it easy to define different hive table schema using external tables. You also need to ensure that the target directory in hdfs does not exist prior to running the sqoop command.
Following will fetch all the data from the table and store it as text file in target directory in HDFS.
hadoop fs -rmr /target/directory/in/hdfs

sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --as-textfile
             --target-dir /target/directory/in/hdfs
The above tries to execute a query to extract column names from the table. If this is not possible then it will error out with
ERROR tool.ImportTool: Imported Failed: Attempted to generate class with no columns!
In that case, we have to specify the columns manually in the command.
sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --columns "column1,column2,column3,.."
             --as-textfile
             --target-dir /target/directory/in/hdfs
The above command works well when primary key is defined for the table. If this is not specified, you will have to perform sequential import by forcing a single mapper.
sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --columns "column1,column2,column3,.."
             --as-textfile
             --target-dir /target/directory/in/hdfs
             -m 1
or split by column in the table
sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --columns "column1,column2,column3,.."
             --as-textfile
             --target-dir /target/directory/in/hdfs
             --split-by COLUMNNAME
I will explore few other nuances in importing data into HDFS in subsequent article.

6 comments:

  1. hey can we use oracle wallet keys in oracle import, export i mean how can i remove this username and password param from command ?

    ReplyDelete
    Replies
    1. I have not been able to remove the username and password parameters but you don't have to give it on command-line for it to be discovered in history. The startup command can ask for it for the first time.

      Delete
  2. Hi, My Table has couple of string columns and one clob. How can I move this data from my oracle table to HBase? Also I am planning to unmarshall the XML object and store the data in HBase.

    Thanks
    Raj

    ReplyDelete
  3. Import worked out well as per above commands.. Need your help on exporting csv file from hdfs to a table on oracle DB.

    ReplyDelete
  4. I need to EXPORT a csv file from hdfs directory to ORACLE DB. Please help me out on this..
    sqoop export --options-file oracle_db_connect.txt --table stitem_sls_d --export-dir /user/root/testdata/sqoop_test/ --input-fields-terminated-by ','

    above command throws exception error.

    ReplyDelete
  5. HI can we append imported data from mysql to existing file in hdfs with sqoop.because I want to append daily data into hdfs, but in sqoop-append every time it creates new file in existing folder like part-m-00000,part-m-00001... etc. I want it should append every time in part-m-00000 file. how Can it done?

    ReplyDelete