In this article by Rajanarayanan Thottuvaikkatumana, author of the book Cassandra Design Patterns, Second Edition, the author has discussed how Apache Cassandra is one of the most popular NoSQL data stores. He states this based on the research paper Dynamo: Amazon’s Highly Available Key-Value Store and the research paper Bigtable: A Distributed Storage System for Structured Data. Cassandra is implemented with best features from both of these research papers. In general, NoSQL data stores can be classified into the following groups:
- Key-value data store
- Column family data store
- Document data store
- Graph data store
Cassandra belongs to the column family data store group. Cassandra’s peer-to-peer architecture avoids single point failures in the cluster of Cassandra nodes and gives the ability to distribute the nodes across racks or data centres. This makes Cassandra a linearly scalable data store. In other words, the more processing you need, the more Cassandra nodes you can add to your cluster. Cassandra’s multi data centre support makes it a perfect choice to replicate the data stores across data centres for disaster recovery, high availability, separating transaction processing, analytical environments, and for building resiliency into the data store infrastructure.
Design patterns in Cassandra
The term “design patterns” is a highly misinterpreted term in the software development community. In an extremely general sense, it is a set of solutions for some known problems in quite a specific context. It is used in this book to describe a pattern of using certain features of Cassandra to solve some real-world problems. This book is a collection of such design patterns with real-world examples.
Cassandra is one of the highly successful NoSQL data stores, which is greatly similar to the traditional RDBMS. Cassandra column families (also known as Cassandra tables), in a logical perspective, have a similarity with RDBMS-based tables in the view of the users, even though the underlying structure of these tables are totally different. Because of this, Cassandra is best fit to be deployed along with the traditional RDBMS to solve some of the problems that RDBMS is not able to handle. The caveat here is that because of the similarity of RDBMS tables and Cassandra column families in the view of the end users, many users and data modelers try to use Cassandra in the exact the same way as the RDBMS schema is being modeled, used, and getting into serious deployment issues. How do you prevent such pitfalls? The key here is to understand the differences in a theoretical perspective as well as in a practical perspective, and follow best practices prescribed by the creators of Cassandra.
Where do you start with Cassandra? The best place to look at is the new application development requirements and take it from there. Look at the cases where there is a need to normalize the RDBMS tables and keep all the data items together, which would have got distributed if you were to design the same solution in RDBMS. Instead of thinking from the pure data model perspective, start thinking in terms of the application's perspective. How the data is generated by the application, what are the read requirements, what are the write requirements, what is the response time expected out of some of the use cases, and so on. Depending on these aspects, design the data model. In the big data world, the application becomes the first class citizen and the data model leaves the driving seat in the application design. Design the data model to serve the needs of the applications.
In any organization, new reporting requirements come all the time. The major challenge in to generate reports is the underlying data store. In the RDBMS world, reporting is always a challenge. You may have to join multiple tables to generate even simple reports. Even though the RDBMS objects such as views, stored procedures, and indexes maybe used to get the desired data for the reports, when the report is being generated, the query plan is going to be very complex most of the time. The consumption of processing power is another need to consider when generating such reports on the fly. Because of these complexities, many times, for reporting requirements, it is common to keep separate tables containing data exported from the transactional tables. This is a great opportunity to start with NoSQL stores like Cassandra as a reporting data store.
Data aggregation and summarization are common requirements in any organization. This helps to control the data growth by storing only the summary statistics and moving the transactional data into archives. Many times, this aggregated and summarized data is used for statistical analysis. Making the summary accurate and easily accessible is a big challenge. Most of the time, data aggregation and reporting goes hand in hand. The aggregated data is heavily used in reports. The aggregation process speeds up the queries to a great extent. This is another place where you can start with NoSQL stores like Cassandra.
The coexistence of RDBMS and NoSQL data stores like Cassandra is very much possible, feasible, and sensible; and this is the only way to get started with the NoSQL movement, unless you embark on a totally new product development from scratch. In summary, this section of the book discusses about some design patterns related to de-normalization, reporting, and aggregation of data using Cassandra as the preferred NoSQL data store.
RDBMS migration patterns
A big bang approach to any kind of technology migration is not advisable. A series of deliberations have to happen before the eventual and complete change over. Migration from RDBMS to Cassandra is not different at all. Any new technology replacing an old one must coexist harmoniously, at least for a short period of time. This gives a lot of confidence on the new technology to the stakeholders.
Many technology pundits give various approaches on the RDBMS to NoSQL migration strategies. Many such guidelines are specific to the particular NoSQL data stores giving attention to specific areas, and most of the time, this will end up on the process rather than the technology. The migration from RDBMS to Cassandra is not an easy task. Mainly because the RDBMS-based systems are really time tested and trust worthy in most of the organizations. So, migrating from such a robust RDBMS-based system to Cassandra is not going to be easy for anyone. One of the best approaches to achieve this goal is to exploit some of the new or unique features in Cassandra, which many of the traditional RDBMS don't have. This also prevents the usage of Cassandra just like any other RDBMS. Cassandra is unique. Cassandra is not an RDBMS. The approach of banking on the unique features is not only applicable to the RDBMS to Cassandra migration, but also to any migration from one paradigm to another. Some of the design patterns that are discussed in this section of the book revolve around very simple and important features of Cassandra, but have profound application potential when designing the next generation NoSQL data stores using Cassandra. A wise usage of these unique features in Cassandra will give a head start on the eventual and complete migration from RDBMS.
The modeling of collection objects in RDBMS is a real pain, because multiple tables are to be defined and a join is required to access data. Many RDBMS offer this by providing capability to define user-defined data types, but there is absolutely no standardization at all in this space. Collection objects are very commonly seen in the real-world applications. A list of actions, tuple of related values, set of objects, dictionaries, and things like that come quite often in applications. Cassandra has elegant ways to model this because they are data types in column families.
Counting is a very commonly required process in many business processes and applications. In RDBMS, this has to be modeled as integers or long numbers, but many times, applications make big mistakes in using them in wrong ways. Cassandra has a counter data type in the column family that alleviates this problem.
Getting rid of unwanted records from an RDBMS table is not an automatic process. When some application events occur, they have to be removed by application programs or through some other means. But in many situations, many data items will have a preallocated time to live. They should go away without the intervention of any external events. Cassandra has a way to assign time-to-live (TTL) attribute to data items. By making use of TTL, data items get removed without any other external event's intervention.
All the design patterns covered in this section of the book revolve around some of the new features of Cassandra that will make the migration from RDBMS to Cassandra an easy task.
Cache migration pattern
Database access whether it is from RDBMS or other highly distributed NoSQL data stores is always an input/output (I/O) intensive operation. It makes perfect sense to cache the frequently used, but reasonably static data for fast access for the applications consuming this data. In such situations, the in-memory cache is preferred to the repeated database access for each request. Using cache is not always a pleasant experience. Getting into really weird problems such as data loss, data getting out of sync with its source and other data integrity problems are very common.
It is very common to see wrong components coming into the enterprise solution stack all the time for various reasons. Overlooking on some of the features and adopting the technology without much background work is a very common pitfall. Many a times, the use of cache comes into the solution stack to reduce the latency of the responses. Once the initial results are favorable, more and more data will get tossed into the cache. Slowly, this will become a practice to see that more and more data is getting into cache. Now is the time when problems start popping up one by one. Pure in-memory cache solutions are favored by everybody, by the virtue of its ability to serve the data quickly until you start loosing data. This is because of the faults in the system, along with application and node crashes.
Cache serves data much faster than being served from other data stores. But if the caching solution in use is giving data integrity problems, it is better to migrate to NoSQL data stores like Cassandra. Is Cassandra faster than the in-memory caching solutions? The obvious answer is no. But it is not as bad as many think. Cassandra can be configured to serve fast reads, and bonus comes in the form of high data integrity with strong replication capabilities.
Cache is good as long as it serves its purpose without any data loss or any other data integrity issues. Emphasizing on the use case of the key/value type cache and various methods of cache to NoSQL migration are discussed in this section of the book. Cassandra cannot be used as a replacement for cache in terms of the speed of data access. But when it comes to data integrity, Cassandra shines all the time with its tuneable consistency feature. With a continual tuning and manipulating data with clean and well-written application code, data access can be improved to a great level, and it will be much better than many other data stores.
The design pattern covered in this section of the book gives some guidance on migrating from caching solutions to Cassandra, if this is a must.
When it comes to large-scale Internet applications or web services, popularly known as the Internet of Things (IoT) applications, the number of components are huge and the way they are distributed is beyond imagination. There will be hundreds of application servers, hundreds of data store nodes, and many other components in the whole ecosystem. In such a scenario, for doing an atomic transaction by getting an agreement from all the components involved is, for all practical purposes, impossible. Consistency, availability, and partition tolerance are three important guarantees, popularly known as CAP guarantees that any distributed computing systems should offer even though all is not possible simultaneously.
In the IoT applications, the distribution of the application nodes is unavoidable. This means that the possibility of network partition is pretty much there. So, it is mandatory to give the P guarantee. Now, the question is whether to forfeit the C guarantee or the A guarantee. At this stage, the situation is not as grave as portrayed in the CAP Theorem conjectured by Eric Brewer. For all the use cases in a given IoT application, there is no need of having 100% of C guarantee and 100% of A guarantee. So, depending on the need of the level of A guarantee, the C guarantee can be tuned. In other words, it is called tunable consistency.
Depending on the way data is ingested into Cassandra, and the way it is consumed from Cassandra, tuning is possible to give best results for the appropriate read and write requirements of the applications.
In some applications, the speed at which the data is written will be very high. In other words, the velocity of the data ingestion into Cassandra is very high. This falls into the write-heavy applications. In some applications, the need to read data quickly will be an important requirement. This is mainly needed in the applications where there is a lot of data processing required. Data analytics applications, batch processing applications, and so on fall under this category. These fall into the read-heavy applications. Now, there is a third category of applications where there is an equal importance for fast writes as well as fast reads. These are the kind of applications where there is a constant inflow of data, and at the same time, there is a need to read the data by clients for various purposes. This falls into the read-write balanced applications.
The consistency level requirements for all the previous three types of applications are totally different. There is no one way to tune so that it is optimal for all the three types of applications. All the three applications' consistency levels are to be tuned differently from use case to use case.
In this section of the book, various design patterns related to applications with the needs of fast writes, fast reads, and moderate write and read are discussed. All these design patterns revolve around using the tuneable consistency parameters of Cassandra. Whether it is for write or read and if the consistency levels are set high, the availability levels will be low and vice versa. So, by making use of the consistency level knob, the Cassandra data store can be used for various types of writing and reading use cases.
In any applications, the usage of data that varies over the period of time is called as temporal data, which is very important. Temporal data is needed wherever there is a need to maintain chronology. There are so many applications in which there is a huge need for storage, retrieval, and processing of data that is tied to time.
The biggest challenge in dealing with temporal data stored in a data store is that they are hugely used for analytical purposes and retrieving the data, based on various sort orders in terms of time. So, the data stores that are used to capture the temporal data should be capable of storing the data strictly adhering to the chronology.
There are so many usage patterns that are seen in the real world that fall into showing temporal behavior. For the classification purpose in this book, they are bucketed into three. The first one is the general time series category. The second one is the log category, such as in an audit log, a transaction log, and so on. The third one is the conversation category, such as in the conversation messages of a chat application. There is relevance in this classification, because these are commonly used across in many of the applications. In many of the applications, these are really cross cutting concerns; and designers underestimate this aspect; and finally, many of the applications will have different data stores capturing this temporal data. There is a need to have a common strategy dealing with temporal data that fall in these three commonly seen categories in an enterprise wide solution architecture. In other words, there should be a uniform way of capturing temporal data; there should be a uniform way of processing temporal data; and there should be a commonly used set of tools and libraries to manage the temporal data.
Out of the three design patterns that are discussed in this section of the book, the first Time Series pattern is a general design pattern that covers the most general behavior of any kind of temporal data. The next two design patterns namely Log pattern and Conversation pattern are two special cases of the first design pattern.
This section of the book covers the general nature of temporal data, some specific instances of such data items in the real-world applications, and why Cassandra is the best fit as a NoSQL data store to persist the temporal data. Temporal data comes quite often in many use cases of lots of applications. Data modeling of temporal data is very important in the Cassandra perspective for optimal storage and quick access of the data. Some common design patterns to model temporal data have been covered in this section of the book. By focusing on some very few aspects, such as the partition key, primary key, clustering column and the number of records that gets stored in a wide row of Cassandra, very effective and high performing temporal data models can be built.
The 3Vs of big data namely Volume, Variety, and Velocity pose another big challenge, which is the analysis of the data stored in NoSQL data stores, such as Cassandra. What are the analytics use cases? How can the distributed data be processed? What are the data transformations that are typically seen in the applications? These are the topics covered in this section of the book.
Unlike other sections of this book, the focus is shifted from Cassandra to other technologies like Apache Hadoop, Hadoop MapReduce, and Apache Spark to introduce the big data analytics tool space. The design patterns such as Map/Reduce Pattern and Transformation Pattern are very commonly seen in the data analytics world. Cassandra with Apache Spark has good compatibility, and is a very ideal tool set in the data analysis use cases.
This section of the book covers some data analysis aspects and mainly discusses about data processing. Data transformation is one of the major activity in data processing. Out of the many data processing patterns, Map/Reduce Pattern deserves a special mention, because it is being used in so many batch processing and analysis use cases, dealing with big data. Spark has been chosen as the tool of choice to explain the data processing activities. This section explains how a Map/Reduce kind of data processing task can be done using Cassandra. Spark has also been discussed, which is very powerful to perform online data analysis. This section of the book also covers some of the commonly seen data transformations that are used in the data processing applications.
Many Cassandra design patterns have been covered in this book. If the design patterns are not being used in any real-world applications, it has only theoretical value. To give a practical approach to the applicability of these design patterns, an end-to-end application is taken as a case point and described as the last chapter of the book, which is used as a vehicle to explain the applicability of the Cassandra design patterns discussed in the earlier sections of the book.
Users love Cassandra because of its SQL-like interface CQL. Also, its features are very closely related to the RDBMS even though the paradigm is totally new. Application developers love Cassandra because of the plethora of drivers available in the market so that they can write applications in their preferred programming language. Architects love Cassandra because they can store structured, semi-structured, and unstructured data in it. Database administers love Cassandra because it comes with almost no maintenance overhead. Service managers love Cassandra because of the wonderful monitoring tools available in the market. CIOs love Cassandra because it gives value for their money. And Cassandra works! An application based on Cassandra will be perfect only if its features are used in the right way, and this book is an attempt to guide the Cassandra community in this direction.
Resources for Article:
- Cassandra Architecture [article]
- Getting Up and Running with Cassandra [article]
- Getting Started with Apache Cassandra [article]