Thursday, June 7, 2012

Ingest data from database into Hadoop with Sqoop (2)

Here, I explore few other variations for importing data from database into HDFS. This is a continuation of previous article..

Previous sqoop command listed were good for one time fetch when you want to import all the current data for a table in database.

A more practical workflow is to fetch data regularly and incrementally into HDFS for analysis. You do not want to skip any previously imported data. For this you have to mark a column for incremental import and also provide an initial value. This column mostly happens to be time-stamp.
sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --columns "column1,column2,column3,.."
             --as-textfile
             --target-dir /target/directory/in/hdfs
             -m 1
             --check-column COLUMN3
             --incremental  lastmodified
             --last-value "LAST VALUE"

Ingest data from database into Hadoop with Sqoop (1)

Sqoop is an easy tool to import data from databases to HDFS and export data from Hadoop/Hive tables to Databases. Databases has been de-facto standard for storing structured data. Running complex queries on large data in databases can be detrimental to their performance.
It is some times useful to import the data into Hadoop for ad hoc analysis. Tools like hive, raw map-reduce can provide tremendous flexibility in performing various kinds of analysis.
This becomes particularly useful when database has been used mostly as storage device (Ex: Storing XML or unstructured string data as clob data).

Sqoop is very simple on it's face. Internally, it uses map-reduce in parallel data import from Database and utilizes JDBC connection for the purpose.

I am jumping straight into using sqoop with oracle database and will leave installation for some other post.

Sqoop commands are executed from command lines using following structure:
sqoop COMMAND [ARGS]
All available sqoop commands can be listed with: sqoop help

Article focuses on importing from database specifically Oracle DB.

Wednesday, May 16, 2012

Hadoop Map-Reduce with mrjob

With Hadoop, you have more flexibility in accessing files and running map-reduce jobs with java. All other languages needs to use Hadoop streaming and it feels like a second class citizen in Hadoop programming.

For those who like to write map-reduce programs in python, there are good toolkit available out there like mrjob and dumbo.
Internally, they still use Hadoop streaming to submit map-reduce jobs. These tools simplify the process of map-reduce job submission. My own experience with mrjob has been good so far. Installing and using mrjob is easy.

Installing mrjob

First ensure that you have installed a higher version of python than the default that comes with Linux (2.4.x for supporting yum). Ensure that you don't replace the existing python distribution as it breaks "yum".

Install mrjob on one of the machine in your Hadoop cluster. It is nicer to use virtualenv for creating isolated environment.
wget -O virtualenv.py http://bit.ly/virtualenv
/usr/bin/python26 virtualenv.py pythonenv
hadoopenv/bin/easy_install pip
hadoopenv/bin/pip install mrjob

Tuesday, May 15, 2012

HBase pseudo-cluster installation

I have been preparing a vm with Hbase installed in pseudo-cluster mode for experimental purposes. There are quite a few useful blogs on installing Hbase. I settled on the following minimum installation procedure.

I am blogging it for future reference. Hopefully it will help others too.

Before proceeding to install Hbase in pseudo cluster mode, you can check out the procedures for installing Hadoop in pseudo-cluster mode.

A few tweaks are required in OS configuration. Add the following to /etc/security/limits.conf:
  • hdfs  -       nofile  32768
  • hbase  -       nofile  32768

A few changes are required to hadoop configuration that I have mentioned earlier. Add following to hdfs-site.xml

   <property>
      <name>dfs.datanode.max.xcievers</name>
      <value>4096</value>
   </property>


Monday, May 14, 2012

Hadoop pseudo-cluster installation

Install Java and cloudera yum repo
yum install java-1.6.0-openjdk.x86_64
curl -O http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo
mv cloudera-cdh3.repo /etc/yum.repos.d/

Ensure that you have hostname and localhost entries in /etc/hosts
comment out ipv6 entry

Create hadoop user and group manually
Create "hdfs" and "mapred" user with group "hadoop"
groupadd hadoop
useradd -G hadoop hdfs
useradd -G hadoop mapred
passwd hdfs 
passwd mapred 

Tuesday, May 8, 2012

Few things to take care while building vagrant boxes

Following are some of the tricks that were useful to me while creating Oracle Enterprise Linux vagrant box.
Create vm using VDI format for easy handling.
Make sure you have removed all the extraneous packages from the installed vm.
You can check out package descriptions at pkgs.org
yum remove X11
yum list installed | grep gnome

Also ensure that yum installs only relevant language support
Edit /etc/rpm/macros.lang and include
%_install_langs en:fr

Thursday, March 15, 2012

External tables in Hive are handy

Usually when you create tables in hive using raw data in HDFS, it moves them to a different location - "/user/hive/warehouse". If you created a simple table, it will be located inside the data warehouse. The following hive command creates a table with data location at "/user/hive/warehouse/user".
hive>   CREATE TABLE user(id INT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Consider that the raw data is located at "/home/admin/data1.txt" and if you issues the following hive command, the data would be moved to a new location at "/user/hive/warehouse/user/data1.txt".
hive> LOAD DATA INPATH '/home/admin/userdata/data1.txt' INTO TABLE user;

If we want to just do hive queries, it is all fine. When you drop the table, the raw data is lost as the directory corresponding to the table in warehouse is deleted.
You may also not want to delete the raw data as some one else might use it in map-reduce programs external to hive analysis. It is far more convenient to retain the data at original location via "EXTERNAL" tables.

Tuesday, March 13, 2012

Introducing ØMQ and pyzmq through examples

ØMQ is a messaging library that has the capability to revolutionize distributed software development.

Unlike full fledged messaging systems, it provides the right set of abstractions to incorporate various messaging patterns. It also provides the concept of devices which allows creation of complex network topology.

To get a quick overview, you can read the introduction to ØMQ by Nicholas Piël.

ØMQ sockets are a light abstraction on top of native sockets.
This allows it to remove certain constraints and add new ones that makes writing messaging infrastructure a breeze.
  • ØMQ sockets adhere to predefined messaging patterns and has to be defined during ØMQ socket creation time.
  • ØMQ sockets can connect  to many ØMQ sockets unlike the native sockets.
  • There is constraint on type of ØMQ sockets  that can connect to each other.

ØMQ has bindings for many languages including python (pyzmq) and that makes it very interesting.

It has been fun learning the basics and I hope soon to create some real world examples to whet my knowledge of ØMQ. Till then, I hope that the mini tutorial on ØMQ and pyzmq will serve as good introduction to it's capabilities.

Check out : http://readthedocs.org/docs/learning-0mq-with-pyzmq/en/latest/index.html

It is quite easy to get started. Use virtualenv and pip.
pip install pyzmq-static
pip install tornado

Checkout the code from https://github.com/ashishrv/pyzmqnotes
Follow some of the annotated examples to see the awesomeness of ØMQ.

Do post your feedback on the mini tutorial here as comments.