Wednesday, May 16, 2012

Hadoop Map-Reduce with mrjob

With Hadoop, you have more flexibility in accessing files and running map-reduce jobs with java. All other languages needs to use Hadoop streaming and it feels like a second class citizen in Hadoop programming.

For those who like to write map-reduce programs in python, there are good toolkit available out there like mrjob and dumbo.
Internally, they still use Hadoop streaming to submit map-reduce jobs. These tools simplify the process of map-reduce job submission. My own experience with mrjob has been good so far. Installing and using mrjob is easy.

Installing mrjob

First ensure that you have installed a higher version of python than the default that comes with Linux (2.4.x for supporting yum). Ensure that you don't replace the existing python distribution as it breaks "yum".

Install mrjob on one of the machine in your Hadoop cluster. It is nicer to use virtualenv for creating isolated environment.
wget -O
/usr/bin/python26 pythonenv
hadoopenv/bin/easy_install pip
hadoopenv/bin/pip install mrjob

The current version available to me is "mrjob==".

There is a small ugly hack that you need to make in one of the file: pythonenv/lib/python2.6/site-packages/mrjob/ at line number 444.

I am not sure if I am doing something wrong but it throws an exception that "self._start_step_num" is None.

Replace with the following lines.

# look for a Python trace-back
cause = None
if self._start_step_num and step_num:
    cause = self._find_probable_cause_of_failure(
               [step_num + self._start_step_num])

You also need to set the HADOOP_HOME variable.
export HADOOP_HOME=/usr/lib/hadoop

Thats it and you should be ready to use mrjob!

Writing map-reduce program

Now we can run through the familiar word-count example.
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
It is as simple as that. You need to create a class derived from MRjob and provide essential methods like mapper, combiner and reducer.
Depending on what you want to do, you may need only a mapper or mapper and reducer or all of them.

Running map-reduce with mrjob

To run this program, you need to issue the following command
pythonenv/bin/python hdfs:///path/to/file/inhdfs -r hadoop --python-bin python26 --step-num=1

hdfs:///path/to/file/inhdfs => input dir or file in hdfs
-r hadoop => tells mrjob to run the job on hadoop cluster
--python-bin python26 => use newer version of python executable
--step-num=1 => tells the step to execute

You should be able to successfully run the map-reduce using mrjob.
Input to a mapper is a line and it's output is a (key, value) pair . In this case, it's output is (keyword, 1) pair.

Reducer takes key value pair and reduces it. In the above program, it outputs (keyword, occurrences) pairs.
Streaming final output from hdfs:///somepath/tmp/mrjob/test.admin.20120506.133838.502705/output
"a" 2
"about" 1
"adapting" 1
"again" 2

You can also provide multiple input by specifying them directly during invocation of mrjob command
hdfs:///path/to/file/inhdfs1 hdfs:///path/to/file/inhdfs2 hdfs:///path/to/file/inhdfs3
You can store the output into hdfs or local path with another option to the mrjob command.
--output-dir hdfs:///pathto/wordcount/output/2345

You have to ensure that the parent directory exists in hdfs and output directory does not exist in HDFS, or else it will error out.
hadoop fs -mkdir  hdfs:///pathto/wordcount
hadoop fs -rmdir  hdfs:///pathto/wordcount/output/2345

Here there is always some output from the reducer phase if the file is non-empty.
In certain map-reduce programs like grep/matching regular expressions, it may not always yield an output. Hadoop map-reduce considers this as a failure.

In order to avoid the issue, you will have to pass the following option to your map-reduce program.

There are few more options that allows you to write elaborate map-reduce programs using mrjob. Check out the documentation for the details.


  1. Is there any other way to get answer like this? I tried with out success. Any way thanks for your help.
    I learned a lot from Besant Technologies in my college days. They are the Best Hadoop Training Institute in Chennai

    1. I have read your blog its very attractive and impressive. I like it your blog.

      Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

      Java Online Training Java Online Training Core Java 8 Training in Chennai Core java 8 online training JavaEE Training in Chennai Java EE Training in Chennai

  2. I get a lot of great information here and this is what I am searching for Hadoop. Thank you for your sharing. I have bookmark this page for my future reference.Thanks so much for the work you have put into this post.
    Hadoop Training in hyderabad

  3. There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this (Salesforce Training in Chennai).

  4. I have read your blog, it was good to read & I am getting some useful info's through your blog keep sharing... Informatica is an ETL tools helps to transform your old business leads into new vision. Learn Informatica training in chennai from corporate professionals with very good experience in informatica tool.
    Best Informatica Training In Chennai|Informatica training center in Chennai|Informatica training chennai

  5. There are many blogs about the cloud and hadoop out there but this is completely different which has made me completeletely attached to this blog for the information on Hadoop subject. I only learned subject like this at hadoop online training center earlier. Thanks.

  6. This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..
    Selenium Training in Chennai | QTP Training in Chennai

  7. Thanks for Information Oracle Apps Technical is a collection of a bunch of collected applications like accounts payables, purchasing, inventory, accounts receivables, human resources, order management, general ledger and fixed assets, etc which have its own functionality for serving the business
    Oracle Apps Training In Chennai

  8. Oracle Training in chennai | Oracle D2K Training In chennai
    This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..

  9. Pretty article! I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing.

    sas training in Chennai|sas course in Chennai|sas training institute in Chennai

  10. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    Oracle Training In Chennai

  11. Hi Admin, I went through your article and it’s totally awesome. You can consider including RSS feed for easy content sharing, So that you can drive huge traffic to your blog. Hadoop Training in Chennai | Big Data Training in Chennai

  12. this hadoop and technology is excellent to explained the concept.Gives the more idea about the hadoop.It is a best post.

    java training in chennai

  13. very informative blog. Helps to gain knowledge about new concepts and techniques. Thanks for posting information in this blog
    selenium Training in Chennai

  14. This blog is impressive and informative.It clearly explains about the concept and its techniques.Thanks for sharing this information.Please update this type of information
    hadoop training in chennai

  15. Thanks for sharing a this article the above article having a valuable information,useful.I daily follow this article.

  16. Hadoop is one of the best cloud based tool for analysis the big data. With the increase in the usage of big data there is a quite a demand for Hadoop professionals.
    Big data Hadoop Training

  17. Did you know that you can create short links with Shortest and earn money for every visitor to your shortened links.

  18. Real executable code could be quite verbose, so I’ve decided to use pseudocode. Unfortunately I don’t have python implementations.

    hadoop training in chennai

  19. This comment has been removed by the author.

  20. thank you for sharing this informative blog.. this blog really helpful for everyone.. explanation are clear so easy to understand... I got more useful information from this blog

    hadoop training | big data training | hadoop training in chennai | big data training in chennai

  21. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

    hadoop training institute in tambaram | big data training institute in tambaram | hadoop training in chennai tambaram | big data training in chennai tambaram

  22. This blog is gives great information on big data hadoop online training in hyderabad, uk, usa, canada.

    best online hadoop training in hyderabad.
    hadoop online training in usa, uk, canada.

  23. Thanks for sharing.Learn a training related courses get a 100% placement Assistant...............
    Dot Net Training in Chennai
    Hadoop Training in Chennai
    Dot Net Training in Chennai

  24. Just found your post by searching on the Google, I am Impressed and Learned Lot of new thing from your post. I am new to blogging and always try to learn new skill as I believe that blogging is the full time job for learning new things day by day. "Emergers Technologies"

  25. Helpful as always. Every post you write produce a massive value to your readers that is the only reason it is so popular and has great authority.

    Hadoop Training in Chennai

    Base SAS Training in Chennai

  26. Thanks for sharing Valuable information about hadoop. Really helpful. Keep sharing...........

  27. I gone through your post completely.Its is very usrful for all the hadoop developers. This post has massive value to the reader. Thanks for sharing this kind of blog.

    Hadoop Training in Chennai

  28. This is excellent information. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

  29. I think i am so lucky i found such a quality post very easy way.Thanks admin for this kind of post.This kind of post make more value for internet user. etl testing jobs for fresher’s in hyderabad.

  30. I'm learning hadoop technologies.The map reduce concepts I've not prepare observed learn map reduced.That time i search for google.I will read your web site content amazing. Some way i like this web site.
    Software Testing Training in Chennai
    Selenium Training

  31. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai

  32. Thank you so much for sharing this worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.

    Hadoop Online Training

  33. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai