Skip to main content

Book Review: Cassandra Design Patterns

This post is my review of the Packt Publishing book Cassandra Design patterns by Sanjay Sharma. As the main title suggest, it's all about pattern and anti pattern of using Cassandra. The book has almost 74 pages covering 6 chapters.
Preface: What this book covers and Who this book is for
The preface of the book starts with the ideas behind this book. The main idea behind this book is for Cassandra audience to understand where and how to use Cassandra correctly and effectively.The Preface also provides brief summaries of each of the six chapters in the book and convention that follows in this books.
Under the section "What you need for this book", author specified that, readers don't need any special version of Cassandra, however Cassandra 2.0 or above version will be proffered. The "Who this book is for" section of the Preface specify the audience of the book, it may be architect, or developer starting with Cassandra.
Chapter 1: An Overview of Architecture and Data Modeling in Cassandra
In this chapter author briefly describing the architecture and history of the Cassandra. Most of the books of Cassandra and articles always admit this important information. It really nice to know, how Cassandra pick up or combine the best features of two technologies, Google Big table as data model and Amazon Dynamo for scale out. Author also provide core Cassandra architecture, how Cassandra handle write and read under the hood, consistency level and much more. In one moment i cant agree with author about Cassandra read performance (when data is not in cache), Cassandra is not fast for read data. For every single read Cassandra need 2 iops to disk which make Cassandra slow to read data.
In the last part of the chapter author outlines the features of Cassandra which is very useful.
Chapter 2: An Overview of Case and Design Patterns
This chapter introduced a few key use cases and design patterns that briefly discussed in the following chapters. First of all author describes the 3V model and how Cassandra fit on it. Next section coverage Cassandra's high availability architecture and comparison with Oracle RDBMS. In the next few sections, author introduced Cassandra schema flexibility, counter column, streaming analytics capability and much more. Like the first chapter, this second chapter covers additional information about Cassandra strength and features.
Chapter 3: 3V Patterns
3rd chapter covered fundamentals patterns of Cassandra, describe where and how Cassandra should be uses. First part of the chapter provided how Cassandra can handle huge amount of data to scale web application. Also mentioned why RDBMS such Oracle or Terrdata is not vertically scaling well. Finally describe the patterns solution, focused on Cassandra CQL, 3rd party framework which rich Cassandra.
Second pattern focused on Cassandra fast write ability. Cassandra support parallel writes where each node in a cluster is responsible for a specific key range which differs Cassandra from traditional RDBMS. Author also provide benchmark from Netflix which is very impressive and informative to prove the pattern.
Last pattern of this chapter described Cassandra's schema less feature. This Cassandra feature pick from the Google big table and allow Cassandra to store data in multiple formats. This pattern is one of the main advantage over RDBMS, where RDBMS never support schema changes online.
Chapter 4: Core Cassandra Patterns
4th chapter starts with the Cassandra's fundamental feature - high availability. Cassandra provides high availability data store with peer to peer communication between nodes. This feature provides fine-grained control over how the data is spread and replicated across different data centers. Example with Oracle golden gate was very informative and helpful.
Next section refers Cassandra's time series data manipulations. Author successfully explain the term time series with examples and provide solution with CQL pseudo code. Example with CQL code clear the concepts also show how to store and retrieve time series data from Cassandra. Additional information about kariosdb project with a few word fulfil the section.
Last pattern in this chapter provide when and how to use counter column to keep tracing of event or content. Example with pseudo code completely clear the concept and show the reader how it's works under the hood.
Chapter 5: Search and Analytics Applied Use Case Patterns
Chapter five focused on a serious topics about data analysis and search. Every enterprise application need some search capability on data, it could be simple search or complex context search. On the other hand data analysis is another business challenge, which can be very travail. First section of this chapter focused on streaming analytics or real time analytics. Author provides reference architecture where combine Storm framework with Cassandra to do real time analytics. With storm bolt you can easily got precomputed value or aggregate data to alerts on some event.
Second section of this chapter dedicated to enterprise search. Cassandra like any other db doesn't support enterprise search out of box. For enterprise search, now any one can use two most popular search engine Solr and Elasticsearch based on lucene technology. Author explains what inverted index is and when not using Cassandra secondary index.
In the third section, author focused on graph analysis. However Cassandra data model is not fit for graph analysis and there are other databases specially fit for this type of model, for example neo4j is one of the popular graph database fit for these task. But if somebody have to solve graph analysis over Cassandra data, framework Titan can solve this problem.
Final section of this chapter dedicated to Hadoop Cassandra integration. This topics is too much big to write another few tom. Cassandra provides ColumnFamilyInputFormat and ColumnFamilyOutputFormat class that helps to run Map reduce on hadoop. You can use pig (data flow language) tools to run batch analysis over Cassandra data from Hadoop Map reduce, even more you can use Hive like query. Author forget to provide another framework like Spark or presto to fast data analysis
Chapter 6: Patterns and Anti-patterns
Last chapter focused on some additional patterns and anti-patterns, which was very interesting to read. Pattern Content/document store based on question, Which data store to use as a content/document store? Under the hood Cassandra store any data in raw bytes, which allows content or documents to be stored as raw bytes as column values. Author also provide framework Astyanax, which support for storing and retrieving large objects in chunk.
Pattern Materialized view, i have found very useful and detailed explained. Author introduced two implementing of materialized view in Cassandra, Application-tier-driven materialized view and Analytics-driven materialized view.
Last part of the chapter followed with anti-pattern Messaging queue. A lot of time i heard from peoples that, they are using Cassandra as a Messaging Queue, i asked them why? Most of them can't answer the question, a few of them tried to make distributed persistence queue with Cassandra. There are hazelcast and much more product to use as a Distributed queue. I have to agree with author that, messenging queue is an anti-pattern of use Cassandra.
Summery:
I really enjoyed Cassandra Design Patterns and recommended it to anyone interested in learning about Cassandra. Author touched most of all main topics of Cassandra and explained very easily. Now we are depends on Datastax to get any documentation about Cassandra and documentation about Cassandra is not very available. This book could be a major source of information to decided when and how to use Cassandra for real life problem. Thank'x to author Sanjay Sharma for such a nice book and Packt publication to give me a change to review this book.

Comments

Popular posts from this blog

8 things every developer should know about the Apache Ignite caching

Any technology, no matter how advanced it is, will not be able to solve your problems if you implement it improperly. Caching, precisely when it comes to the use of a distributed caching, can only accelerate your application with the proper use and configurations of it. From this point of view, Apache Ignite is no different, and there are a few steps to consider before using it in the production environment. In this article, we describe various technics that can help you to plan and adequately use of Apache Ignite as cutting-edge caching technology. Do proper capacity planning before using Ignite cluster. Do paperwork for understanding the size of the cache, number of CPUs or how many JVMs will be required. Let’s assume that you are using Hibernate as an ORM in 10 application servers and wish to use Ignite as an L2 cache. Calculate the total memory usages and the number of Ignite nodes you have to need for maintaining your SLA. An incorrect number of the Ignite nodes can become a b...

Tip: SQL client for Apache Ignite cache

A new SQL client configuration described in  The Apache Ignite book . If it got you interested, check out the rest of the book for more helpful information. Apache Ignite provides SQL queries execution on the caches, SQL syntax is an ANSI-99 compliant. Therefore, you can execute SQL queries against any caches from any SQL client which supports JDBC thin client. This section is for those, who feels comfortable with SQL rather than execute a bunch of code to retrieve data from the cache. Apache Ignite out of the box shipped with JDBC driver that allows you to connect to Ignite caches and retrieve distributed data from the cache using standard SQL queries. Rest of the section of this chapter will describe how to connect SQL IDE (Integrated Development Environment) to Ignite cache and executes some SQL queries to play with the data. SQL IDE or SQL editor can simplify the development process and allow you to get productive much quicker. Most database vendors have their own fron...

Load balancing and fail over with scheduler

Every programmer at least develop one Scheduler or Job in their life time of programming. Nowadays writing or developing scheduler to get you job done is very simple, but when you are thinking about high availability or load balancing your scheduler or job it getting some tricky. Even more when you have a few instance of your scheduler but only one can be run at a time also need some tricks to done. A long time ago i used some data base table lock to achieved such a functionality as leader election. Around 2010 when Zookeeper comes into play, i always preferred to use Zookeeper to bring high availability and scalability. For using Zookeeper you have to need Zookeeper cluster with minimum 3 nodes and maintain the cluster. Our new customer denied to use such a open source product in their environment and i was definitely need to find something alternative. Definitely Quartz was the next choose. Quartz makes developing scheduler easy and simple. Quartz clustering feature brings the HA and...