Skip to main content

Lesson learned : Hadoop + Cassandra integration

After a few weeks break at last we completed our tuning and configuration cassandra hadoop stack in production. It was exciting and i decided to share our experience with all.
1) Cassandra version >> 1.2 has some problems and doesn't integrate with Hadoop very well. The problem with Map Reduce, when we runs any Map reduce job, it always assigns only one mapper regardless of the amount of data. See here for more detail.
2) If you are going to use Pig for you data analysis, think twice, because Pig always picks up all the data from the Cassandra Storage and only after these it can filter. If you have a billions of rows and only a few millions of then you have to aggregate, then Pig always pick up the billions of rows.Here you can find a compression between Hadoop framework for executing Map reduce.
3) If you are using Pig, filter rows as early as possible. Filter fields like null or empty.
4) When using Pig, try to model your CF slightly different. Use Bucket pattern, store your data by weeks or months, it's better than store all the data in one CF. Consider to use TTL.
5) If you have more than 8GB of heap, consider JVM from IBM JVM9 or Azul JVM
6) Always use separate hard disk (High speed, i.e more than 7200 rpm) for Cassandra commit log.
7) Sizing your hardware carefully, choose between RAID 5 and RAID 10 depends on your need. it's better to create separate LUN for every node.
8) Tune bloom filter, if you have analytical node. See here for more information.
9) Tune you Cassandra CF, use ROW cache. Cassandra row cache is off heap cache as like memcache. Its slightly slower than heap cache but much faster than disk IO.

Disclaimer:
Every experience describe above is my own and it could be differ from any others experiences.
UP1 - Bug already fixed in version 1.2.6


Comments

Popular posts from this blog

8 things every developer should know about the Apache Ignite caching

Any technology, no matter how advanced it is, will not be able to solve your problems if you implement it improperly. Caching, precisely when it comes to the use of a distributed caching, can only accelerate your application with the proper use and configurations of it. From this point of view, Apache Ignite is no different, and there are a few steps to consider before using it in the production environment. In this article, we describe various technics that can help you to plan and adequately use of Apache Ignite as cutting-edge caching technology. Do proper capacity planning before using Ignite cluster. Do paperwork for understanding the size of the cache, number of CPUs or how many JVMs will be required. Let’s assume that you are using Hibernate as an ORM in 10 application servers and wish to use Ignite as an L2 cache. Calculate the total memory usages and the number of Ignite nodes you have to need for maintaining your SLA. An incorrect number of the Ignite nodes can become a b...

Analyse with ANT - a sonar way

After the Javaone conference in Moscow, i have found some free hours to play with Sonar . Here is a quick steps to start analyzing with ANT projects. Sonar provides Analyze with ANT document to play around with ANT, i have just modify some parts. Here is it. 1) Download the Sonar Ant Task and put it in your ${ANT_HOME}/lib directory 2) Modify your ANT build.xml as follows: <?xml version = '1.0' encoding = 'windows-1251'?> <project name="abc" default="build" basedir="."> <!-- Define the Sonar task if this hasn't been done in a common script --> <taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml"> <classpath path="E:\java\ant\1.8\apache-ant-1.8.0\lib" /> </taskdef> <!-- Out-of-the-box those parameters are optional --> <property name="sonar.jdbc.url" value="jdbc:oracle:thin:@xyz/sirius.xyz" /> <property na...

Writing weblogic logs to database table

By default, oracle weblogic server logging service uses an implementation, based on the Java Logging APIs by using the LogMBean.isLog4jLoggingEnabled attribute. With a few effort you can use log4j with weblogic logging service. In the Administration Console, you can specify Log4j or keep the default Java Logging implementation. In this blog i will describe how to configure log4j with weblogic logging service and writes all the logs messages to database table. Most of all cases it's sufficient to writes log on files, however it's better to get all the logs on table to query on it. In our case we have 3 different web logic servers in our project and our consumer need to get all the logs in one central place to diagnose if something goes wrong. First of all we will create a simple table on our oracle database schema and next configure all other parts. Here we go: 1) CREATE TABLE LOGS (USER_ID VARCHAR2(20), DOMAIN varchar2(50), DATED DATE NOT NULL, LOGGER VARCHAR2(500) NOT...