Wednesday, May 16, 2012

Hadoop Map-Reduce with mrjob

With Hadoop, you have more flexibility in accessing files and running map-reduce jobs with java. All other languages needs to use Hadoop streaming and it feels like a second class citizen in Hadoop programming.

For those who like to write map-reduce programs in python, there are good toolkit available out there like mrjob and dumbo.
Internally, they still use Hadoop streaming to submit map-reduce jobs. These tools simplify the process of map-reduce job submission. My own experience with mrjob has been good so far. Installing and using mrjob is easy.

Installing mrjob

First ensure that you have installed a higher version of python than the default that comes with Linux (2.4.x for supporting yum). Ensure that you don't replace the existing python distribution as it breaks "yum".

Install mrjob on one of the machine in your Hadoop cluster. It is nicer to use virtualenv for creating isolated environment.
wget -O virtualenv.py http://bit.ly/virtualenv
/usr/bin/python26 virtualenv.py pythonenv
hadoopenv/bin/easy_install pip
hadoopenv/bin/pip install mrjob

Tuesday, May 15, 2012

HBase pseudo-cluster installation

I have been preparing a vm with Hbase installed in pseudo-cluster mode for experimental purposes. There are quite a few useful blogs on installing Hbase. I settled on the following minimum installation procedure.

I am blogging it for future reference. Hopefully it will help others too.

Before proceeding to install Hbase in pseudo cluster mode, you can check out the procedures for installing Hadoop in pseudo-cluster mode.

A few tweaks are required in OS configuration. Add the following to /etc/security/limits.conf:
  • hdfs  -       nofile  32768
  • hbase  -       nofile  32768

A few changes are required to hadoop configuration that I have mentioned earlier. Add following to hdfs-site.xml

   <property>
      <name>dfs.datanode.max.xcievers</name>
      <value>4096</value>
   </property>


Monday, May 14, 2012

Hadoop pseudo-cluster installation

Install Java and cloudera yum repo
yum install java-1.6.0-openjdk.x86_64
curl -O http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo
mv cloudera-cdh3.repo /etc/yum.repos.d/

Ensure that you have hostname and localhost entries in /etc/hosts
comment out ipv6 entry

Create hadoop user and group manually
Create "hdfs" and "mapred" user with group "hadoop"
groupadd hadoop
useradd -G hadoop hdfs
useradd -G hadoop mapred
passwd hdfs 
passwd mapred 

Tuesday, May 8, 2012

Few things to take care while building vagrant boxes

Following are some of the tricks that were useful to me while creating Oracle Enterprise Linux vagrant box.
Create vm using VDI format for easy handling.
Make sure you have removed all the extraneous packages from the installed vm.
You can check out package descriptions at pkgs.org
yum remove X11
yum list installed | grep gnome

Also ensure that yum installs only relevant language support
Edit /etc/rpm/macros.lang and include
%_install_langs en:fr