Wednesday, August 14, 2013

Centralized logging for distributed applications with pyzmq

Simpler distributed applications can take advantage of centralized logging. PyZMQ, a Python bindings for ØMQ provides log handlers for the python logging module and can be easily used for this purpose. Log handlers utilizes ØMQ Pub/Sub pattern and broadcasts log messages through a PUB socket. It is quite easy to construct the message collector and write messages to a central location.
         +-------------+                        |
         |Machine1:App2|                 +---------------+
         +-------------+                          |
                   +-------------+                |

Thursday, June 7, 2012

Ingest data from database into Hadoop with Sqoop (2)

Here, I explore few other variations for importing data from database into HDFS. This is a continuation of previous article..

Previous sqoop command listed were good for one time fetch when you want to import all the current data for a table in database.

A more practical workflow is to fetch data regularly and incrementally into HDFS for analysis. You do not want to skip any previously imported data. For this you have to mark a column for incremental import and also provide an initial value. This column mostly happens to be time-stamp.
sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             --table TABLENAME
             --columns "column1,column2,column3,.."
             --target-dir /target/directory/in/hdfs
             -m 1
             --check-column COLUMN3
             --incremental  lastmodified
             --last-value "LAST VALUE"