Programmer's notebook

Building and packaging a python application for distribution

2014-09-17T22:08:00.001-07:00

I find it messy to build a python application package that is easy to distribute . Though, python tools have come a long way.
"pip/wheel" helps in installing, managing and distributing individual packages. "virtualenv" provides an approachable way to create isolated environment and avoid the pollution of the main distribution. "zc.buildout" lets you reproduce / assemble and deploy a python application through configuration. Together, they provide a powerful framework for building and distributing python applications. However, It is not as simple as build once and distribute everywhere model of executable / jar package distribution. In all likelihood, you will be creating an isolated virtual environment, installing the dependencies and the application into it.

Centralized logging for distributed applications with pyzmq

2013-08-14T01:13:00.002-07:00

Simpler distributed applications can take advantage of centralized logging. PyZMQ, a Python bindings for ØMQ provides log handlers for the python logging module and can be easily used for this purpose. Log handlers utilizes ØMQ Pub/Sub pattern and broadcasts log messages through a PUB socket. It is quite easy to construct the message collector and write messages to a central location.

         +-------------+
         |Machine1:App1+-------------------------
         +-------------+                        |
                                         +---------------+
         +-------------..................|Machine3:Logger|
         |Machine1:App2|                 +---------------+
         +-------------+                          |
                                                  |
                   +-------------+                |
                   |Machine2:App1|-----------------
                   +-------------+

Ingest data from database into Hadoop with Sqoop (2)

2012-06-07T23:12:00.001-07:00

Here, I explore few other variations for importing data from database into HDFS. This is a continuation of previous article..

Previous sqoop command listed were good for one time fetch when you want to import all the current data for a table in database.

A more practical workflow is to fetch data regularly and incrementally into HDFS for analysis. You do not want to skip any previously imported data. For this you have to mark a column for incremental import and also provide an initial value. This column mostly happens to be time-stamp.

sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --columns "column1,column2,column3,.."
             --as-textfile
             --target-dir /target/directory/in/hdfs
             -m 1
             --check-column COLUMN3
             --incremental  lastmodified
             --last-value "LAST VALUE"

Ingest data from database into Hadoop with Sqoop (1)

2012-06-07T21:15:00.002-07:00

Sqoop is an easy tool to import data from databases to HDFS and export data from Hadoop/Hive tables to Databases. Databases has been de-facto standard for storing structured data. Running complex queries on large data in databases can be detrimental to their performance.
It is some times useful to import the data into Hadoop for ad hoc analysis. Tools like hive, raw map-reduce can provide tremendous flexibility in performing various kinds of analysis.
This becomes particularly useful when database has been used mostly as storage device (Ex: Storing XML or unstructured string data as clob data).

Sqoop is very simple on it's face. Internally, it uses map-reduce in parallel data import from Database and utilizes JDBC connection for the purpose.

I am jumping straight into using sqoop with oracle database and will leave installation for some other post.

Sqoop commands are executed from command lines using following structure:

sqoop COMMAND [ARGS]

All available sqoop commands can be listed with: sqoop help

Article focuses on importing from database specifically Oracle DB.

Hadoop Map-Reduce with mrjob

2012-05-16T20:47:00.001-07:00

With Hadoop, you have more flexibility in accessing files and running map-reduce jobs with java. All other languages needs to use Hadoop streaming and it feels like a second class citizen in Hadoop programming.

For those who like to write map-reduce programs in python, there are good toolkit available out there like mrjob and dumbo.
Internally, they still use Hadoop streaming to submit map-reduce jobs. These tools simplify the process of map-reduce job submission. My own experience with mrjob has been good so far. Installing and using mrjob is easy.

Installing mrjob

First ensure that you have installed a higher version of python than the default that comes with Linux (2.4.x for supporting yum). Ensure that you don't replace the existing python distribution as it breaks "yum".

Install mrjob on one of the machine in your Hadoop cluster. It is nicer to use virtualenv for creating isolated environment.

wget -O virtualenv.py http://bit.ly/virtualenv
/usr/bin/python26 virtualenv.py pythonenv
hadoopenv/bin/easy_install pip
hadoopenv/bin/pip install mrjob

HBase pseudo-cluster installation

2012-05-15T19:21:00.001-07:00

I have been preparing a vm with Hbase installed in pseudo-cluster mode for experimental purposes. There are quite a few useful blogs on installing Hbase. I settled on the following minimum installation procedure.

I am blogging it for future reference. Hopefully it will help others too.

Before proceeding to install Hbase in pseudo cluster mode, you can check out the procedures for installing Hadoop in pseudo-cluster mode.

A few tweaks are required in OS configuration. Add the following to /etc/security/limits.conf:

hdfs - nofile 32768
hbase - nofile 32768

A few changes are required to hadoop configuration that I have mentioned earlier. Add following to hdfs-site.xml


   <property>
      <name>dfs.datanode.max.xcievers</name>
      <value>4096</value>
   </property>

Hadoop pseudo-cluster installation

2012-05-14T16:55:00.003-07:00

Install Java and cloudera yum repo

yum install java-1.6.0-openjdk.x86_64
curl -O http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo
mv cloudera-cdh3.repo /etc/yum.repos.d/

Ensure that you have hostname and localhost entries in /etc/hosts

comment out ipv6 entry

Create hadoop user and group manually

Create "hdfs" and "mapred" user with group "hadoop"
groupadd hadoop
useradd -G hadoop hdfs
useradd -G hadoop mapred
passwd hdfs 
passwd mapred

Few things to take care while building vagrant boxes

2012-05-08T09:25:00.000-07:00

Following are some of the tricks that were useful to me while creating Oracle Enterprise Linux vagrant box.
Create vm using VDI format for easy handling.
Make sure you have removed all the extraneous packages from the installed vm.
You can check out package descriptions at pkgs.org.

yum remove X11
yum list installed | grep gnome

Also ensure that yum installs only relevant language support

Edit /etc/rpm/macros.lang and include
%_install_langs en:fr

External tables in Hive are handy

2012-03-15T10:12:00.001-07:00

Usually when you create tables in hive using raw data in HDFS, it moves them to a different location - "/user/hive/warehouse". If you created a simple table, it will be located inside the data warehouse. The following hive command creates a table with data location at "/user/hive/warehouse/user".

hive>   CREATE TABLE user(id INT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Consider that the raw data is located at "/home/admin/data1.txt" and if you issues the following hive command, the data would be moved to a new location at "/user/hive/warehouse/user/data1.txt".

hive> LOAD DATA INPATH '/home/admin/userdata/data1.txt' INTO TABLE user;

If we want to just do hive queries, it is all fine. When you drop the table, the raw data is lost as the directory corresponding to the table in warehouse is deleted.
You may also not want to delete the raw data as some one else might use it in map-reduce programs external to hive analysis. It is far more convenient to retain the data at original location via "EXTERNAL" tables.

Introducing ØMQ and pyzmq through examples

2012-03-13T11:57:00.001-07:00

ØMQ is a messaging library that has the capability to revolutionize distributed software development.

Unlike full fledged messaging systems, it provides the right set of abstractions to incorporate various messaging patterns. It also provides the concept of devices which allows creation of complex network topology.

To get a quick overview, you can read the introduction to ØMQ by Nicholas Piël.

ØMQ sockets are a light abstraction on top of native sockets.
This allows it to remove certain constraints and add new ones that makes writing messaging infrastructure a breeze.

ØMQ sockets adhere to predefined messaging patterns and has to be defined during ØMQ socket creation time.
ØMQ sockets can connect to many ØMQ sockets unlike the native sockets.
There is constraint on type of ØMQ sockets that can connect to each other.

ØMQ has bindings for many languages including python (pyzmq) and that makes it very interesting.

It has been fun learning the basics and I hope soon to create some real world examples to whet my knowledge of ØMQ. Till then, I hope that the mini tutorial on ØMQ and pyzmq will serve as good introduction to it's capabilities.

Check out : http://readthedocs.org/docs/learning-0mq-with-pyzmq/en/latest/index.html

It is quite easy to get started. Use virtualenv and pip.

pip install pyzmq-static
pip install tornado

Checkout the code from https://github.com/ashishrv/pyzmqnotes
Follow some of the annotated examples to see the awesomeness of ØMQ.

Do post your feedback on the mini tutorial here as comments.

My experiences with Fabric based deployment automation

2011-12-01T23:52:00.001-08:00

Many good tools are available for configuration management and application deployment.
Puppet, Chef have attained cult status among the dev-ops team. There are good tools available in Python too. Salt may soon become a viable alternative and looks definitely promising to me. Push-Pull is commonly used to explain various types of tools in the eco-system.

Fabric is an excellent tool that allows you to weave operation locally and remotely on cluster of machines, allowing you to deploy applications, start/stop services and perform other operations on a cluster of machines. There are few good tutorials available to help familiarize with Fabric. If you haven't read it already, you should.

I have used Fabric to automate deployment of Hadoop / Hive application, Nagios deployment on cluster of machines on EC2, private cloud based on Cloudstack and commodity machines.

The code grew from nifty little commands / functions like setting up the "Fully qualified domain hostname" (FQDN), creating users and groups on Linux, installing yum packages to a complete system of commands that installs and brings up Kerberos enabled secure hadoop cluster using Cloudera hadoop packages.

Code soon became unwieldily.

There are a few practices that helps contain the level of complexity that grows when using Fabric enthusiastically.

Installing funkload on Mac

2011-12-01T11:30:00.001-08:00

Funkload is a useful tool for understanding the characteristics of the application server under stress and load conditions.

Installation of Funkload is very straightforward using virtualenv and macports on Mac.
If you aren't using it already, you should think of checking it out.

Create a isolated environment for installing Funkload.

virtualenv --no-site-packages loadtest
source loadtest/bin/activate
pip install yolk

Creating base box from scratch for Vagrant

2011-11-29T10:32:00.001-08:00

At vagrantbox.es, you can find boxes for many flavours like CentOS, Ubuntu, Debian etc.

How ever, you might require a flavour of OS that is not available packaged for you already.
In such a case, you might want to package it for use with Vagrant.
I needed Oracle Enterprise Linux Box.

Following is a step by step approach to create a base box for Oracle Enterprise Linux 5.7 64 bit version.

Creating a VM on VirtualBox

Step 1: Get the ISO file from which we will install the Oracle Enterprise Linux.

Step 2: Create your virtual machine on VirtualBox.

  Create a new Virtual Machine 
      Type: VMDK
      Name : oel57
      Base memory size: 512 MB, Memory Space Maximum 40 GB
      Enable Host I/O cache

Using Vagrant

2011-11-28T13:50:00.001-08:00

Vagrant is a great tool for creating vm at whim and tearing it down so that you could start all over again. It helps to start from a clean state, when you are trying to test deployment and setups. Vagrant requires VirtualBox and is written in Ruby.

Following is a step by step take down on how to setup and use vagrant on Mac

Learning Twisted (part 8) - Anatomy of deferreds in Twisted

2010-11-02T11:58:00.000-07:00

There are numerous posts and document on conceptual explanation of one of the central concepts in Twisted framework - deferred.

The book on Twisted network programming provides an analogy of deferreds as buzzers which are handed to a visitor by a restaurant owner. This buzzer notifies the visitor that the table is ready and he could set aside what ever he has been doing and can come over to occupy the table meant for him.

Others identify deferreds as a place holder or a promise that is yet to be fulfilled. We could attach other actions that should follow when the promise is fulfilled or breached. These actions are like callback chains that would be triggered when a deferred fires.

Deferreds allows you to create followup action for something that will take some time to get fulfilled. This in turn relieves twisted to attend to other tasks and come back to execute follow-up actions when the condition is completed.

I will keep myself to code commentary and current behavior of deferreds.
Read more »

Tracing call flows in Python

2010-10-29T15:48:00.000-07:00

Python decorators comes handy when you want to intercept a piece of call flow and profiling technique seems just too verbose.
I use this quite often to analyze a python program to understand it better.

Consider the following piece of contrived python code to illustrate this approach of tracing python call flows.

def f():
    f1('some value')


def f1(result): 
    print result
    f2("f1 result")
    

def f2(result): 
    print result
    f3("f2 result")
    fe("f2 result")
    return "f2 result"

def f3(result): 
    print result
    return "f3 result"

def fe(result): 
    print result

f()

Output:

some value
f1 result
f2 result
f2 result

Before taking a dip into haskell

2010-10-21T15:00:00.000-07:00

I have been itching to start learning another language. I have been perusing through rather a voluminous opinions on what language to learn, on the net.
Too many opinions and it could freeze you from doing something. In any case, I have taken the plunge and would start learning haskell, keeping a commentary on the same here.

Before I do that, I really wanted to have Haskell syntax highlighting support in blogger.

I am yet to test it though. so here is a snippet attached that should have been highlighted. Of course, this code is not mine and just serves to confirm that highlighting works.

module Main where

main = putStrLn "Hello, World!"

Buildbot - Issue with svn poller

2010-10-20T00:26:00.000-07:00

SVN poller may miss a check-in based on poll interval.

The current behavior of the poller is

The poller polls the version control system and stores last change (version number). Subsequent changes are notified as log entries. These log entries are marked with the Time Stamp when the changes are noticed. These log entries are used to create change objects that is then passed to scheduler to trigger builds.Scheduler sees these change objects with the same timestamp and picks the latest change object to trigger the build.

The issue with this model is that if there are multiple changes within a single polling interval, this poller will result in triggering build only for the last one.
Read more »

Python wisdom from stackoverflow #1

2010-10-13T23:38:00.001-07:00

I had started participating in "stack overflow" in anticipation to improve my knowledge on topics of interest. What would be better than answering, working on problems posted by users and also look at the answers provided by various folks from the community.

In many posts, I could find some very elegant way of attacking the problem that I had never thought of. It was clear that there are nuggets of wisdom buried in "stack overflow" and mostly it would be difficult to go back and look at them. So I started by collecting weekly wisdoms on topic of my interest which usually is "Python programing". The good thing is that they are going to be unrelated snippets and bad thing is that their isn't any central theme to these posts.

Starting with this post, I will try to pull some neat solutions provided there for reference and later perusal.

#1 : round-up numbers to two decimal points

anFloat = 1234.55555
 print round(anFloat, 2)
# Output : 1234.5599999999999
rounded = "%.2f" % round(anFloat, 2)
print rounded
# Output: '1234.56'

Setting up buildbot - customizing configuration file

2010-10-13T10:31:00.000-07:00

The crux of BuildBot involves a master and multiple build slaves that can be distributed across many computers.

Each Builder is configured with a list of BuildSlaves that it will use for its builds. Within a single BuildSlave, each Builder creates its own SlaveBuilder instance.
Once a SlaveBuilder is available, the Builder pulls one or more BuildRequests off its incoming queue.These requests are merged into a single Build instance, which includes the SourceStamp that describes what exact version of the source code should be used for the build. The Build is then randomly assigned to a free SlaveBuilder and the build begins.

All this is configured via a single file called master.cfg which is a dictionary of various keys that is used to configure the buildbot when it starts up.
Open up the sample "master.cfg" that comes with the buildbot distribution, drop it in the master directory that you have created and start hacking it.

I have listed few important configuration that should get you started.
Below is instance of dictionary that is populated in the configuration file

c = BuildmasterConfig = {}

Learning Twisted (part 7) : Understanding protocol class implementation

2010-10-06T16:15:00.000-07:00

In my last post, I had focused on protocol factory class, various methods it needs to provide and also the code flow within which these methods gets called or invoked.
Here we will look into the structure of protocol class , various methods it needs to provide and context in which they are called.

There are two ways to lookup and learn this:

Look at the interface definition: IProtocol(Interface) in interfaces.py
Like I did in my previous posting , supply a protocol class with no methods and look at the traceback to understand the code flow

So usual imports for writing a custom protocol:

from twisted.web import proxy
from twisted.internet import reactor
from twisted.internet import protocol
from twisted.python import log
import sys
log.startLogging(sys.stdout)

It is much better to derive from protocol.Protocol to build custom protocol. It does a few things for you.

Any intricate logic should be built using the connect, disconnect, data received event handlers and methods to write data onto connection

makeConnection method sets the transport attribute and also calls connectionMade method . You can use this to start communicating once the connection setup has been established.
dataReceived method is called when there is data to be read off the connection. connectionLost is called when the transport connection is lost for some reason. To write data on the connection, you use the transport method self.transport.write. This results in adding the data to buffer which will be sent across the connection. To make twisted send the data or buffer immediately, you can call self.transport.doWrite

Read more »

Tools that I find useful with mac

2010-09-30T11:38:00.000-07:00

Here is my list of useful tools on mac:

Notational Velocity is a cool way to keep textual notes.

I always had issue of manually deleting the archive after extraction, Unarchiver helps with that.

Want to have your favorite websites into Mac desktop applications, use fluidinfo.

Read more »

Using buildbot for continuos integration development

2010-09-27T16:25:00.000-07:00

Continuos integration in it's simplicity embodies certain agile tenets like frequent integration of code and automated verification of the integrated code to provide continuous feedback to the team on development and reducing heart burns during large integrations. It also avoids silent creeping in of broken builds into the code repository. At the heart of this process is a tool that can be integrated with the workflow of code check-in to trigger automated testing of frequently checked in development artifacts.

This helps in providing the developer an immediate feedback and assurance that things are moving in a positive direction.

Buildbot is a "continuos integration" tool.

BuildBot can automate the compile/test cycle required by most software projects to validate code changes.

I had a chance to set it up some time back. What follows is a snippet of that experience on getting it up and running quickly.

Read more »

Ubantu on Mac OSX using VirtualBox

2010-09-23T13:39:00.000-07:00

I installed ubantu on mac osx using VirtualBox some time back. Installation went fairly easy except for the fact that I had to look out for increasing the resolution from the default 800X600.

Here is an step by step approach to install and use VirtualBox.

Python and binary data - Part 3

2010-09-22T14:48:00.000-07:00

Normal file operations that we use are line oriented

FILE = open(filename,"w")
FILE.writelines(linelist)
FILE .close()
FILE = open(filename,"r")
for line in FILE.readlines(): print line
FILE .close()

We can also use byte oriented I/O operations on these files.

FILE = open(filename,"r")
FILE.read(numBytes)  # This reads up to numBytes of Bytes from the file.

But if the file does not contain textual data, the contents may not be meaningful.

It is much better to open the file in binary mode

FILE = open(filename,"rb")
FILE.read(numBytes)