Thursday, June 7, 2012

Ingest data from database into Hadoop with Sqoop (1)

Sqoop is an easy tool to import data from databases to HDFS and export data from Hadoop/Hive tables to Databases. Databases has been de-facto standard for storing structured data. Running complex queries on large data in databases can be detrimental to their performance.
It is some times useful to import the data into Hadoop for ad hoc analysis. Tools like hive, raw map-reduce can provide tremendous flexibility in performing various kinds of analysis.
This becomes particularly useful when database has been used mostly as storage device (Ex: Storing XML or unstructured string data as clob data).

Sqoop is very simple on it's face. Internally, it uses map-reduce in parallel data import from Database and utilizes JDBC connection for the purpose.

I am jumping straight into using sqoop with oracle database and will leave installation for some other post.

Sqoop commands are executed from command lines using following structure:
sqoop COMMAND [ARGS]
All available sqoop commands can be listed with: sqoop help

Article focuses on importing from database specifically Oracle DB.
All commands displayed multiline has to be run as a single command.

List database schema present on Oracle server.

User credentials should have have sufficient permission to list the databases.
sqoop list-databases 
             --connect jdbc:oracle:thin:@//HOST:PORT
             --username DBA_USER 
             --password password

Providing password on command line is insecure and we should be using "-P" instead. Sqoop will prompt for the password when you execute the command.

List tables present in a Oracle DB.

sqoop list-tables 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P

Run arbitrary sql command using Sqoop
sqoop eval 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --query "SQL QUERY"

List columns for table in Oracle DB
sqoop eval 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --query "SELECT  t.*  FROM  TABLENAME  t WHERE 1=0"

Importing tables from Oracle DB
There are many destinations from importing data from Oracle DB. You can import data to
  • HDFS as raw files
  • Hive table existing or newly created
  • HBase tables

I have not explored all of them but I definitely prefer storing them as raw files over hive tables. This makes it easy to define different hive table schema using external tables. You also need to ensure that the target directory in hdfs does not exist prior to running the sqoop command.
Following will fetch all the data from the table and store it as text file in target directory in HDFS.
hadoop fs -rmr /target/directory/in/hdfs

sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --as-textfile
             --target-dir /target/directory/in/hdfs
The above tries to execute a query to extract column names from the table. If this is not possible then it will error out with
ERROR tool.ImportTool: Imported Failed: Attempted to generate class with no columns!
In that case, we have to specify the columns manually in the command.
sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --columns "column1,column2,column3,.."
             --as-textfile
             --target-dir /target/directory/in/hdfs
The above command works well when primary key is defined for the table. If this is not specified, you will have to perform sequential import by forcing a single mapper.
sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --columns "column1,column2,column3,.."
             --as-textfile
             --target-dir /target/directory/in/hdfs
             -m 1
or split by column in the table
sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --columns "column1,column2,column3,.."
             --as-textfile
             --target-dir /target/directory/in/hdfs
             --split-by COLUMNNAME
I will explore few other nuances in importing data into HDFS in subsequent article.

18 comments:

  1. hey can we use oracle wallet keys in oracle import, export i mean how can i remove this username and password param from command ?

    ReplyDelete
    Replies
    1. I have not been able to remove the username and password parameters but you don't have to give it on command-line for it to be discovered in history. The startup command can ask for it for the first time.

      Delete
  2. Hi, My Table has couple of string columns and one clob. How can I move this data from my oracle table to HBase? Also I am planning to unmarshall the XML object and store the data in HBase.

    Thanks
    Raj

    ReplyDelete
  3. Import worked out well as per above commands.. Need your help on exporting csv file from hdfs to a table on oracle DB.

    ReplyDelete
  4. I need to EXPORT a csv file from hdfs directory to ORACLE DB. Please help me out on this..
    sqoop export --options-file oracle_db_connect.txt --table stitem_sls_d --export-dir /user/root/testdata/sqoop_test/ --input-fields-terminated-by ','

    above command throws exception error.

    ReplyDelete
  5. HI can we append imported data from mysql to existing file in hdfs with sqoop.because I want to append daily data into hdfs, but in sqoop-append every time it creates new file in existing folder like part-m-00000,part-m-00001... etc. I want it should append every time in part-m-00000 file. how Can it done?

    ReplyDelete
    Replies
    1. use sqoop merge tool to merge those files.

      Delete
  6. Hi Ashish,

    I just tried to load data from Oracle to HDFS

    Before that installed Oracle 11g and tried to import data from Oracle..

    default user: system,password:Admin123$,servicename:orcl and also i copied ojdbc6.jar into sqoop/lib.

    Below are the commands i used and executed but still i am facing the issues

    training@BigDataVM:~/
    Installations/sqoop-1.4.2.bin__hadoop-1.0.0$ bin/sqoop import --connect jdbc:oracle:thin:@//192.168.1.7:1521/orcl --username system --password Admin123$ --table hr.employees -m 1;

    training@BigDataVM:~/Installations/sqoop-1.4.2.bin__hadoop-1.0.0$ bin/sqoop import --connect jdbc:oracle:thin:@//localhost:1521/orcl --username system --password Admin123$ --table hr.employees -m 1;

    training@BigDataVM:~/Installations/sqoop-1.4.2.bin__hadoop-1.0.0$ bin/sqoop import --connect jdbc:oracle:thin:@//192.168.1.7:1521/orcl --username system --password Admin123$ --table employees -m 1;

    training@BigDataVM:~/Install ations/sqoop-1.4.2.bin__hadoop-1.0.0$ bin/sqoop import --connect jdbc:oracle:thin:@//locahost:1521/orcl --username system --password Admin123$ --table employees -m 1;


    Error:

    14/04/01 13:48:29 ERROR manager.SqlManager: Error executing statement: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
    java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection


    Could you please suggest here how to proceed further.

    Thanks
    Venu.Ch

    ReplyDelete
  7. Hi all,

    Will you please send me the query for one hive table to multiple sql server tables in sqoop export?

    One hive table to one sql server destination table is working fine. But our requirement is to develop one to many, many to one and many to many relationship is it possible to do it in hadoop?

    Please send me the result as soon as possible...

    ReplyDelete
  8. This is a great inspiring tutorials on hadoop.I am pretty much pleased with your good work.You put really very helpful information. Keep it up.
    Hadoop Training in hyderabad

    ReplyDelete

  9. I regards For all content was really great.I will clearly undestand and then respects to your clear-cut point.I've to very impressed with our writing style.

    Hadoop Training
    Tableau Training
    Informatica Training
    Angularjs Training
    SAP Success Factor Training
    Seo Training

    ReplyDelete
  10. you can visit to the below url , they are offering very good videos on Hadoop:

    For free videos from previous sessions refer:
    http://hadoopbigdatatutorial.com/hadoop-training/big-data-tutorial


    ReplyDelete
  11. Big data training .All the basic and get the full knowledge of hadoop.
    Big data training


    ReplyDelete
  12. Very good article to explore in sqoop. I also found some useful commands in this link
    www.geoinsyssoft.com/blog
    kindly look this blog for more information about hadoop eco systems.

    ReplyDelete
  13. Using big data analytics may give the companies many fruitful results, the findings can be implemented in their business decisions so as to minimize their risk and to cut the costs.
    Big data Hadoop Training

    ReplyDelete
  14. Finding the time and actual effort to create a superb article like this is great thing. I’ll learn many new stuff right here! Good luck for the next post buddy..
    PHP training in chennai

    ReplyDelete
  15. Nice blog to read.. This blog having more useful information which really helpful to improve my hadoop knowledge .. thanks a lot for sharing this blog to us

    big data training in chennai | best hadoop training in chennai | best big data training in chennai

    ReplyDelete
  16. Great and helpful blog to everyone.. This blog having more useful information which having clear explanation so easy and interesting to read.. This blog really useful to develop my knowledge in hadoop and cracking interview easily.. thanks a lot for sharing this blog to us...

    hadoop training institute in velachery | big data training institute in velachery

    ReplyDelete