My workspace

Posts

Showing posts from 2014

Book Review: Cassandra Design Patterns

This post is my review of the Packt Publishing book Cassandra Design patterns by Sanjay Sharma. As the main title suggest, it's all about pattern and anti pattern of using Cassandra. The book has almost 74 pages covering 6 chapters. Preface: What this book covers and Who this book is for The preface of the book starts with the ideas behind this book. The main idea behind this book is for Cassandra audience to understand where and how to use Cassandra correctly and effectively.The Preface also provides brief summaries of each of the six chapters in the book and convention that follows in this books. Under the section "What you need for this book", author specified that, readers don't need any special version of Cassandra, however Cassandra 2.0 or above version will be proffered. The "Who this book is for" section of the Preface specify the audience of the book, it may be architect, or developer starting with Cassandra. Chapter 1: An Overview of Architect...

Elasticsearch with Cassandra data

Sooner or later every enterprise application needs full text search with their content. Slor, elasticsearch based on lucene are one the best candidate for developying enterprise search. Elasticsearch got very popularity with its simplicity, but out of box it dosen't support importing data from Cassandra cluster. However Elasticsearch provides river, a river is a pluggable service running within elasticsearch cluster pulling data (or being pushed with data) that is then indexed into the cluster. With a few search i have found a cassandra-river on github from ebay, unfortunatley, project was legeacy and only support Cassandra version 1.2*. With a few effort i rewrite the project with data stax cassandra driver. Here you can find the project, now it support the following features: 1) Cron scheduling; 2) Reading Cassandra rows through Paging; 3) Based on DataStax java driver 2.0; For quick installation, download the project from the Github. Build with maven: mvn clean install i...

Interview on PlannetCassandra

Interview on PlanetCassandra.

Ad-hoc analysis over Cassandra data with Facebook Presto

A few days ago I attended in Moscow Cassandra meet up with my presentation, from one of the participant, I heard about Facebook project presto for fast data analysis. I was very curious and hurry up to hands on it. From Presto Site "Presto is a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions". Historically Cassandra was lack of interactive Ad-hoc query, even it's doesn't support any aggregate function in CQL. For this reason, whenever we proposed our customers to utilize Cassandra as a database, they were always confused. However, for analysis data over Cassandra we have the following frameworks: 1) Hadoop Map Reduce 2) Spark and Shark Also a few commercial projects like impala. But Hadoop Map Reduce is definitely slow to use as Ad-Hoc queries. Spark is very fast with its RDD data models, but it also needs a few exercises to run q...

3rd Moscow Cassandra meetup

3rd Moscow Cassandra meetup presentation. Using Spark and Shark to analysis big data over Cassandra data. 3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data ) from shamim_ru

Real time data processing with Cassandra, Part 2

The last few months I was busy with our new telecommunication project to develop MNP (Mobile number portability) for the Russian Federation. Now in Russia anybody can change their telephone operator without changing the number, it's a another history for another blog. Today I have found a few hours to keep my promise. In this blog, I will try to describe how to configure and manage spark pseudo cluster with shark for real time data processing. In the previous blog I will show how to use hive with Hadoop to process data from Cassandra. For whom, who doesn't familiar, Spark is execution engines that supports cyclic data flow and in-memory computing, in the otherhand Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. I am going to use following open source projects to configure and run the spark + shark cluster : 1) Scala-2.10.3 2) Spark-0.9.0-incubating-bin-hadoop1 3) Shark-0.9.0...