Thursday, March 15, 2012

External tables in Hive are handy

Usually when you create tables in hive using raw data in HDFS, it moves them to a different location - "/user/hive/warehouse". If you created a simple table, it will be located inside the data warehouse. The following hive command creates a table with data location at "/user/hive/warehouse/user".

Consider that the raw data is located at "/home/admin/data1.txt" and if you issues the following hive command, the data would be moved to a new location at "/user/hive/warehouse/user/data1.txt".
hive> LOAD DATA INPATH '/home/admin/userdata/data1.txt' INTO TABLE user;

If we want to just do hive queries, it is all fine. When you drop the table, the raw data is lost as the directory corresponding to the table in warehouse is deleted.
You may also not want to delete the raw data as some one else might use it in map-reduce programs external to hive analysis. It is far more convenient to retain the data at original location via "EXTERNAL" tables.

Tuesday, March 13, 2012

Introducing ØMQ and pyzmq through examples

ØMQ is a messaging library that has the capability to revolutionize distributed software development.

Unlike full fledged messaging systems, it provides the right set of abstractions to incorporate various messaging patterns. It also provides the concept of devices which allows creation of complex network topology.

To get a quick overview, you can read the introduction to ØMQ by Nicholas Piël.

ØMQ sockets are a light abstraction on top of native sockets.
This allows it to remove certain constraints and add new ones that makes writing messaging infrastructure a breeze.
  • ØMQ sockets adhere to predefined messaging patterns and has to be defined during ØMQ socket creation time.
  • ØMQ sockets can connect  to many ØMQ sockets unlike the native sockets.
  • There is constraint on type of ØMQ sockets  that can connect to each other.

ØMQ has bindings for many languages including python (pyzmq) and that makes it very interesting.

It has been fun learning the basics and I hope soon to create some real world examples to whet my knowledge of ØMQ. Till then, I hope that the mini tutorial on ØMQ and pyzmq will serve as good introduction to it's capabilities.

Check out :

It is quite easy to get started. Use virtualenv and pip.
pip install pyzmq-static
pip install tornado

Checkout the code from
Follow some of the annotated examples to see the awesomeness of ØMQ.

Do post your feedback on the mini tutorial here as comments.