Skip to main content

Hadoop Map reduce with Cassandra Cql through Pig

One of the main disadvantage of using PIG is that, Pig always raise all the data from Cassandra Storage, and after that it can filter by your choose. It's very easy to imagine how the workload will be if you have a tons of million rows in your CF. For example, in our production environment we have always more than 300 million rows, where only 20-25 millions of rows is unprocessed. When we are executing pig script, we have got more than 5000 map tasks with all the 300 millions of rows. It's time consuming and high load batch processing we always tried to avoid but in vain. It's could be very nice if we could use CQL query in pig scripts with where clause to select and filter our data. Here benefit is clear, less data will consume, less map task and a little workload.


Still in latest version of Cassandra (1.2.6) this feature is not available. This feature is planned in next version Cassandra 1.2.7. However patch is already available for this feature, with a few efforts we can make a try.
First we have to download the source code of the Cassandra from the branch 1.2. Also we should have a configured Hadoop cluster with Pig.
1) Download the Cassandra source code from branch 1.2
git clone -b cassandra-1.2 http://git-wip-us.apache.org/repos/asf/cassandra.git
assume that we already familiar with git.
and also apply the patch fix_where_clause.patch

Now compile the source code and setup the cluster. For testing purpose i am using my single node Hadoop 1.1.2 + Cassandra 1.2.7 + Pig 0.11.1 cluster.
2) To setup single node cluster please see here A single node Hadoop + Cassandra + Pig setup
3) Create a CF as follows:
CREATE TABLE test (
  id text PRIMARY KEY,
  title text,
  age int
);
and insert some dummy data
insert into test (id, title, age) values('1', 'child', 21);
insert into test (id, title, age) values('2', 'support', 21);
insert into test (id, title, age) values('3', 'manager', 31);
insert into test (id, title, age) values('4', 'QA', 41); 
insert into test (id, title, age) values('5', 'QA', 30); 
insert into test (id, title, age) values('6', 'QA', 30); 
4) Execute the following pig script
rows = LOAD 'cql://keyspace1/test?page_size=1&columns=title,age&split_size=4&where_clause=age%3D30' USING CqlStorage();
dump rows;
you should get following result on pig console
((id,5),(age,30),(title,QA))
((id,6),(age,30),(title,QA))
Lets check the Hadoop job history page

Map input records equals 2.
With this new feature we can use where clause to select our desired data from Cassandra storage. You can also check the jira issue tracker to drill down much more.
All the credits goes for the Alex Lui, who implemented this feature.

Comments

Popular posts from this blog

8 things every developer should know about the Apache Ignite caching

Any technology, no matter how advanced it is, will not be able to solve your problems if you implement it improperly. Caching, precisely when it comes to the use of a distributed caching, can only accelerate your application with the proper use and configurations of it. From this point of view, Apache Ignite is no different, and there are a few steps to consider before using it in the production environment. In this article, we describe various technics that can help you to plan and adequately use of Apache Ignite as cutting-edge caching technology. Do proper capacity planning before using Ignite cluster. Do paperwork for understanding the size of the cache, number of CPUs or how many JVMs will be required. Let’s assume that you are using Hibernate as an ORM in 10 application servers and wish to use Ignite as an L2 cache. Calculate the total memory usages and the number of Ignite nodes you have to need for maintaining your SLA. An incorrect number of the Ignite nodes can become a b...

Analyse with ANT - a sonar way

After the Javaone conference in Moscow, i have found some free hours to play with Sonar . Here is a quick steps to start analyzing with ANT projects. Sonar provides Analyze with ANT document to play around with ANT, i have just modify some parts. Here is it. 1) Download the Sonar Ant Task and put it in your ${ANT_HOME}/lib directory 2) Modify your ANT build.xml as follows: <?xml version = '1.0' encoding = 'windows-1251'?> <project name="abc" default="build" basedir="."> <!-- Define the Sonar task if this hasn't been done in a common script --> <taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml"> <classpath path="E:\java\ant\1.8\apache-ant-1.8.0\lib" /> </taskdef> <!-- Out-of-the-box those parameters are optional --> <property name="sonar.jdbc.url" value="jdbc:oracle:thin:@xyz/sirius.xyz" /> <property na...

Writing weblogic logs to database table

By default, oracle weblogic server logging service uses an implementation, based on the Java Logging APIs by using the LogMBean.isLog4jLoggingEnabled attribute. With a few effort you can use log4j with weblogic logging service. In the Administration Console, you can specify Log4j or keep the default Java Logging implementation. In this blog i will describe how to configure log4j with weblogic logging service and writes all the logs messages to database table. Most of all cases it's sufficient to writes log on files, however it's better to get all the logs on table to query on it. In our case we have 3 different web logic servers in our project and our consumer need to get all the logs in one central place to diagnose if something goes wrong. First of all we will create a simple table on our oracle database schema and next configure all other parts. Here we go: 1) CREATE TABLE LOGS (USER_ID VARCHAR2(20), DOMAIN varchar2(50), DATED DATE NOT NULL, LOGGER VARCHAR2(500) NOT...