Wednesday, May 16, 2012

Hadoop Map-Reduce with mrjob

With Hadoop, you have more flexibility in accessing files and running map-reduce jobs with java. All other languages needs to use Hadoop streaming and it feels like a second class citizen in Hadoop programming.

For those who like to write map-reduce programs in python, there are good toolkit available out there like mrjob and dumbo.
Internally, they still use Hadoop streaming to submit map-reduce jobs. These tools simplify the process of map-reduce job submission. My own experience with mrjob has been good so far. Installing and using mrjob is easy.

Installing mrjob

First ensure that you have installed a higher version of python than the default that comes with Linux (2.4.x for supporting yum). Ensure that you don't replace the existing python distribution as it breaks "yum".

Install mrjob on one of the machine in your Hadoop cluster. It is nicer to use virtualenv for creating isolated environment.
wget -O virtualenv.py http://bit.ly/virtualenv
/usr/bin/python26 virtualenv.py pythonenv
hadoopenv/bin/easy_install pip
hadoopenv/bin/pip install mrjob

Tuesday, May 15, 2012

HBase pseudo-cluster installation

I have been preparing a vm with Hbase installed in pseudo-cluster mode for experimental purposes. There are quite a few useful blogs on installing Hbase. I settled on the following minimum installation procedure.

I am blogging it for future reference. Hopefully it will help others too.

Before proceeding to install Hbase in pseudo cluster mode, you can check out the procedures for installing Hadoop in pseudo-cluster mode.

A few tweaks are required in OS configuration. Add the following to /etc/security/limits.conf:
  • hdfs  -       nofile  32768
  • hbase  -       nofile  32768

A few changes are required to hadoop configuration that I have mentioned earlier. Add following to hdfs-site.xml

   <property>
      <name>dfs.datanode.max.xcievers</name>
      <value>4096</value>
   </property>