Wednesday, May 16, 2012

Hadoop Map-Reduce with mrjob

With Hadoop, you have more flexibility in accessing files and running map-reduce jobs with java. All other languages needs to use Hadoop streaming and it feels like a second class citizen in Hadoop programming.

For those who like to write map-reduce programs in python, there are good toolkit available out there like mrjob and dumbo.
Internally, they still use Hadoop streaming to submit map-reduce jobs. These tools simplify the process of map-reduce job submission. My own experience with mrjob has been good so far. Installing and using mrjob is easy.

Installing mrjob

First ensure that you have installed a higher version of python than the default that comes with Linux (2.4.x for supporting yum). Ensure that you don't replace the existing python distribution as it breaks "yum".

Install mrjob on one of the machine in your Hadoop cluster. It is nicer to use virtualenv for creating isolated environment.
wget -O virtualenv.py http://bit.ly/virtualenv
/usr/bin/python26 virtualenv.py pythonenv
hadoopenv/bin/easy_install pip
hadoopenv/bin/pip install mrjob

The current version available to me is "mrjob==0.3.3.2".

There is a small ugly hack that you need to make in one of the file: pythonenv/lib/python2.6/site-packages/mrjob/hadoop.py at line number 444.

I am not sure if I am doing something wrong but it throws an exception that "self._start_step_num" is None.

Replace with the following lines.

# look for a Python trace-back
cause = None
if self._start_step_num and step_num:
    cause = self._find_probable_cause_of_failure(
               [step_num + self._start_step_num])

You also need to set the HADOOP_HOME variable.
export HADOOP_HOME=/usr/lib/hadoop

Thats it and you should be ready to use mrjob!

Writing map-reduce program

Now we can run through the familiar word-count example.
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    MRWordFreqCount.run()
It is as simple as that. You need to create a class derived from MRjob and provide essential methods like mapper, combiner and reducer.
Depending on what you want to do, you may need only a mapper or mapper and reducer or all of them.

Running map-reduce with mrjob

To run this program, you need to issue the following command
pythonenv/bin/python  wordcount.py hdfs:///path/to/file/inhdfs -r hadoop --python-bin python26 --step-num=1

hdfs:///path/to/file/inhdfs => input dir or file in hdfs
-r hadoop => tells mrjob to run the job on hadoop cluster
--python-bin python26 => use newer version of python executable
--step-num=1 => tells the step to execute

You should be able to successfully run the map-reduce using mrjob.
Input to a mapper is a line and it's output is a (key, value) pair . In this case, it's output is (keyword, 1) pair.

Reducer takes key value pair and reduces it. In the above program, it outputs (keyword, occurrences) pairs.
Streaming final output from hdfs:///somepath/tmp/mrjob/test.admin.20120506.133838.502705/output
"a" 2
"about" 1
"adapting" 1
"again" 2
......

You can also provide multiple input by specifying them directly during invocation of mrjob command
hdfs:///path/to/file/inhdfs1 hdfs:///path/to/file/inhdfs2 hdfs:///path/to/file/inhdfs3
You can store the output into hdfs or local path with another option to the mrjob command.
--output-dir hdfs:///pathto/wordcount/output/2345

You have to ensure that the parent directory exists in hdfs and output directory does not exist in HDFS, or else it will error out.
hadoop fs -mkdir  hdfs:///pathto/wordcount
hadoop fs -rmdir  hdfs:///pathto/wordcount/output/2345

Here there is always some output from the reducer phase if the file is non-empty.
In certain map-reduce programs like grep/matching regular expressions, it may not always yield an output. Hadoop map-reduce considers this as a failure.

In order to avoid the issue, you will have to pass the following option to your map-reduce program.
--jobconf stream.non.zero.exit.is.failure=false

There are few more options that allows you to write elaborate map-reduce programs using mrjob. Check out the documentation for the details.

54 comments:


  1. Is there any other way to get answer like this? I tried with out success. Any way thanks for your help.
    I learned a lot from Besant Technologies in my college days. They are the Best Hadoop Training Institute in Chennai





    http://www.hadooptrainingchennai.co.in

    ReplyDelete
    Replies
    1. I have read your blog its very attractive and impressive. I like it your blog.

      Java Training in Chennai Core Java Training in Chennai Core Java Training in Chennai

      Java Online Training Java Online Training Core Java 8 Training in Chennai Core java 8 online training JavaEE Training in Chennai Java EE Training in Chennai

      Delete
  2. I get a lot of great information here and this is what I am searching for Hadoop. Thank you for your sharing. I have bookmark this page for my future reference.Thanks so much for the work you have put into this post.
    Hadoop Training in hyderabad

    ReplyDelete
  3. There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this (Salesforce Training in Chennai).

    ReplyDelete
  4. I have read your blog, it was good to read & I am getting some useful info's through your blog keep sharing... Informatica is an ETL tools helps to transform your old business leads into new vision. Learn Informatica training in chennai from corporate professionals with very good experience in informatica tool.
    Regards,
    Best Informatica Training In Chennai|Informatica training center in Chennai|Informatica training chennai

    ReplyDelete
  5. There are many blogs about the cloud and hadoop out there but this is completely different which has made me completeletely attached to this blog for the information on Hadoop subject. I only learned subject like this at hadoop online training center earlier. Thanks.

    ReplyDelete
  6. This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..
    Selenium Training in Chennai | QTP Training in Chennai

    ReplyDelete
  7. Thanks for Information Oracle Apps Technical is a collection of a bunch of collected applications like accounts payables, purchasing, inventory, accounts receivables, human resources, order management, general ledger and fixed assets, etc which have its own functionality for serving the business
    Oracle Apps Training In Chennai

    ReplyDelete
  8. Oracle Training in chennai | Oracle D2K Training In chennai
    This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..

    ReplyDelete
  9. Pretty article! I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing.
    Regards,

    sas training in Chennai|sas course in Chennai|sas training institute in Chennai

    ReplyDelete
  10. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    Oracle Training In Chennai

    ReplyDelete
  11. Hi Admin, I went through your article and it’s totally awesome. You can consider including RSS feed for easy content sharing, So that you can drive huge traffic to your blog. Hadoop Training in Chennai | Big Data Training in Chennai

    ReplyDelete
  12. this hadoop and technology is excellent to explained the concept.Gives the more idea about the hadoop.It is a best post.


    java training in chennai

    ReplyDelete
  13. very informative blog. Helps to gain knowledge about new concepts and techniques. Thanks for posting information in this blog
    selenium Training in Chennai

    ReplyDelete
  14. This blog is impressive and informative.It clearly explains about the concept and its techniques.Thanks for sharing this information.Please update this type of information
    hadoop training in chennai

    ReplyDelete
  15. Thanks for sharing a this article the above article having a valuable information,useful.I daily follow this article.

    ReplyDelete
  16. Hadoop is one of the best cloud based tool for analysis the big data. With the increase in the usage of big data there is a quite a demand for Hadoop professionals.
    Big data Hadoop Training

    ReplyDelete
  17. Did you know that you can create short links with Shortest and earn money for every visitor to your shortened links.

    ReplyDelete
  18. Real executable code could be quite verbose, so I’ve decided to use pseudocode. Unfortunately I don’t have python implementations.

    hadoop training in chennai

    ReplyDelete
  19. This comment has been removed by the author.

    ReplyDelete
  20. thank you for sharing this informative blog.. this blog really helpful for everyone.. explanation are clear so easy to understand... I got more useful information from this blog

    hadoop training | big data training | hadoop training in chennai | big data training in chennai

    ReplyDelete
  21. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

    hadoop training institute in tambaram | big data training institute in tambaram | hadoop training in chennai tambaram | big data training in chennai tambaram

    ReplyDelete
  22. This blog is gives great information on big data hadoop online training in hyderabad, uk, usa, canada.

    best online hadoop training in hyderabad.
    hadoop online training in usa, uk, canada.

    ReplyDelete
  23. Thanks for sharing.Learn a training related courses get a 100% placement Assistant...............
    Dot Net Training in Chennai
    Hadoop Training in Chennai
    Dot Net Training in Chennai

    ReplyDelete
  24. Just found your post by searching on the Google, I am Impressed and Learned Lot of new thing from your post. I am new to blogging and always try to learn new skill as I believe that blogging is the full time job for learning new things day by day. "Emergers Technologies"

    ReplyDelete
  25. Helpful as always. Every post you write produce a massive value to your readers that is the only reason it is so popular and has great authority.

    Hadoop Training in Chennai

    Base SAS Training in Chennai

    ReplyDelete
  26. Thanks for sharing Valuable information about hadoop. Really helpful. Keep sharing...........

    ReplyDelete
  27. I gone through your post completely.Its is very usrful for all the hadoop developers. This post has massive value to the reader. Thanks for sharing this kind of blog.


    Hadoop Training in Chennai

    ReplyDelete
  28. This is excellent information. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  29. I think i am so lucky i found such a quality post very easy way.Thanks admin for this kind of post.This kind of post make more value for internet user. etl testing jobs for fresher’s in hyderabad.

    ReplyDelete
  30. I'm learning hadoop technologies.The map reduce concepts I've not prepare observed learn map reduced.That time i search for google.I will read your web site content amazing. Some way i like this web site.
    Software Testing Training in Chennai
    Selenium Training

    ReplyDelete
  31. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai

    ReplyDelete
  32. Thank you so much for sharing this worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.

    Hadoop Online Training

    ReplyDelete
  33. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  34. Hello!
    I think that Keep posting more informative articles like these one.
    These are very good articles to visit...
    บาคาร่าออนไลน์
    gclub
    GCLUB มือถือ

    ReplyDelete
  35. I read your blog completely. It is amazing.Thanks for sharing. keep sharing more blogs.


    Android Training in Chennai

    ReplyDelete
  36. great and nice blog thanks sharing..I just want to say that all the information you have given here is awesome...
    Freshers Jobs in Chennai

    ReplyDelete
  37. I have read your blog and I gathered some needful information from your blog. Keep update your blog. Awaiting for your next update.
    Data Science Online Training

    Hadoop Online Training

    ReplyDelete
  38. I enjoyed over read your blog post. Your blog have nice information,
    I got good ideas from this amazing blog.
    goldenslot
    gclub
    gclub casino

    ReplyDelete
  39. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  40. Amazing post.. i got more useful and new information about hadoop which useful to update my hadoop knowledge.. thanks a lot for sharing..

    hadoop training in chennai | big data training in chennai

    ReplyDelete
  41. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    SEO Company in India

    ReplyDelete
  42. retty article! I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing.

    Great post! I am actually getting ready to across this information, It's very helpful for this blog.Also great with all of the valuable information you have Keep up the good work you are doing well.



    Hadoop Training in BTM Layout



    Hadoop Training in Marathahalli

    ReplyDelete
  43. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging…

    Hadoop Training in Marathahalli|
    Hadoop Training in Bangalore|
    Data science training in Marathahalli|
    Data science training in Bangalore|

    ReplyDelete
  44. It's interesting that many of the bloggers your tips helped to clarify a few things for me as well as giving.. very specific nice content. And tell people specific ways to live their lives.Sometimes you just have to yell at people and give them a good shake to get your point across.

    SAP Training in Chennai

    SAP ABAP Training in Chennai

    ReplyDelete