Reader small image

You're reading from  Cassandra High Availability

Product typeBook
Published inDec 2014
PublisherPackt
ISBN-139781783989126
Edition1st Edition
Right arrow
Author (1)
Robbie Strickland
Robbie Strickland
author image
Robbie Strickland

Robbie Strickland has been involved in the Apache Cassandra project since 2010, and he initially went to production with the 0.5 release. He has made numerous contributions over the years, including work on drivers for C# and Scala and multiple contributions to the core Cassandra codebase. In 2013 he became the very first certified Cassandra developer, and in 2014 DataStax selected him as an Apache Cassandra MVP. Robbie has been an active speaker and writer in the Cassandra community and is the founder of the Atlanta Cassandra Users Group. Other examples of his writing can be found on the DataStax blog, and he has presented numerous webinars and conference talks over the years.
Read more about Robbie Strickland

Right arrow

Chapter 7. Modeling for High Availability

A well-designed data model is central to availability in Cassandra, while a poorly chosen model can substantially handicap your application's resiliency. This idea may seem counterintuitive to those with backgrounds in relational database systems, but this chapter may very well be the most critical one in this book.

It's not that data models are unimportant in relational systems, but they are especially critical when attempting to maintain availability in a large distributed database. In fact, this topic is probably the least understood and most difficult aspect of transitioning to Cassandra.

The data modeling problem is somewhat exacerbated by a familiar SQL-style syntax that can lure unsuspecting users into believing they already understand the necessary principles. In reality, the similarity between CQL and SQL ends with syntax. The underlying data structure is vastly different, and therefore a new approach to designing your data model is required...

How Cassandra stores data


Database systems use a variety of structures to represent data on disk. Most traditional relational systems use a tabular approach, which enables random access queries supported by these systems. However, in order to achieve Cassandra's hallmark write performance, it must avoid these sorts of random access disk seeks because random disk I/O tends to be a significant bottleneck. Instead, the system employs a log-structured storage engine, which allows it to write data sequentially to both a commit log and to Cassandra's permanent structure, SSTables.

Implications of a log-structured storage

When a write is received, it is written simultaneously to the commit log and to a memtable. Note that the commit log is what provides durability of writes in Cassandra. Memtables are then periodically flushed to disk in the form of immutable SSTables.

This storage scheme has several important implications related to data modeling:

  • Writes are immutable. Since writes are always essentially...

Understanding compaction


Cassandra deals with this build-up of SSTables over time by means of a process called compaction. Compaction aggregates rows from multiple files into a single file, and in the process it removes old data and purges tombstones. However, housekeeping is only one reason to do this; the other objective is to improve read performance by moving data for a given key into a single SSTable, thereby reducing the disk I/O required to read each key.

The exact mechanism that governs the compaction process depends on which compaction strategy you choose. There are three strategies that currently ship with Cassandra (although you can implement your own):

  • Size-tiered compaction: This strategy causes SSTables to be compacted when there are multiple files of a similar size (the default is four). In update-heavy workloads, a row may exist in many SSTables at once, resulting in reduced read performance.

  • Leveled compaction: This strategy assigns SSTables to levels, where each level represents...

CQL under the hood


At this point, most users should be aware that CQL has replaced Thrift as the standard (and therefore recommended) interface to work with Cassandra. However, it remains largely misunderstood, as its resemblance to common SQL has left both Thrift veterans and Cassandra newcomers confused about how it translates to the underlying storage layer. This fog must be lifted if you hope to create data models that scale, perform, and ensure availability.

As we begin this section, it is important to understand that the CQL data representation does not always match the underlying storage structure. This can be challenging for those accustomed to Thrift-based operations, as those were performed directly against the storage layer. However, CQL introduces an abstraction on top of the storage rows and only maps directly in the simplest of schemas.

Tip

If you want to be successful at modeling and querying data in Cassandra, keep in mind that while CQL improves the learning curve, it is not...

Understanding queries


In order to make sense of the various types of queries, we will start with a common data model to be used across the following examples. For this data model, we will return to the authors table, with name as the partition key, followed by year and title as clustering columns. We'll also sort the year in descending order. This table can be created as follows:

CREATE TABLE authors (
  name text,
  year int,
  title text,
  isbn text,
  publisher text,
  PRIMARY KEY (name, year, title)
) WITH CLUSTERING ORDER BY (year DESC);

Also, for the purpose of these examples, we will assume a replication factor of three and consistency level of QUORUM.

Query by key

We'll start with a basic query by key:

SELECT * FROM authors WHERE name = 'Tom Clancy';

For this simple select, the query makes the request to the coordinator node, which in this case owns a replica for our key. The coordinator then retrieves the row from another replica node to satisfy the quorum. Thus, we need a total of two...

How collections are stored


The introduction of collections to CQL addresses some of the concerns that frequently arose regarding Cassandra's primitive data model. They add richer capabilities that give developers more flexibility when modeling certain types of data.

Cassandra supports three collection types: sets, lists, and maps. In this section, we will examine each of these and take a look at how they're stored under the hood. But first, it's important to understand some basic rules regarding collections:

  • The size of each item in a collection must not be more than 64 KB

  • A maximum of 64,000 items may be stored in a single collection

  • Querying a collection always returns the entire collection

  • Collections are best used for relatively small, bounded datasets

With those rules in mind, we can examine each type of collection in detail, starting with sets.

Sets

A set in CQL is very similar to a set in your favorite programming language. It is a unique collection of items. This means that it does not allow...

Working with time-series data


For most of the last couple decades, data modeling has centered around the relationships among various entities. A person has one account, but one or more phone numbers. That same person has one or more addresses (such as home and work). A person can belong to one or more groups, which can in turn contain many people.

We modeled these relationships using foreign keys and join tables, and we built queries by joining multiple tables together to produce the desired result. However, in recent years, we introduced another dimension to our data: time. Now we're interested in more than just how entities are connected, but how their relationships change over time. For example, while we previously were concerned only about a set of fixed locations associated with a person, we now have mobile phones with GPS radios in pockets and purses all over the world. This makes it possible to produce a timeline of a person's movements by marrying time and location.

Introducing time...

Designing for immutability


An interesting and important difference between modeling relationships versus modeling time-series data is that relational data tends to be mutable whereas time-series data is generally immutable. Mutable data is unstable because it may change at any moment. This makes it more complicated to guarantee we have the most up-to-date version. Immutable data, by contrast, is stable, which means we can avoid many of the complexities associated with data that can change over time.

Tip

If you find yourself struggling with modeling a particular problem in Cassandra, consider reimagining the model as immutable time-series data. This strategy often results in an obvious solution to what appeared to be an intractable problem.

Immutability is a desirable property in a Cassandra data model as updates and deletes can add complexity related to consistency and performance (remember that SSTables are immutable). Often the easiest way to guarantee immutability is to simply add a time...

Working with geospatial data


Another very common use of Cassandra is to store and query geospatial data. Typically, the objective with this type of data is to find points near a given location. The challenge is to find a key that can be used to narrow down the potential list of locations, and to avoid querying many keys at once.

While there is more than one possible data structure that can be used for this purpose, geohashing has a number of benefits that make it worth considering. A geohash is a base 32 representation of a geographic area, where each additional digit represents greater precision. The property of geohashes that makes them particularly suited for geospatial searches is that adding a level of precision to a given geohash results in an area contained within the lower-precision value.

We can visualize this using the following diagram, which shows a geohash, dnh03, with a number of more precise geohashes contained within it. All of the smaller geohashes begin with the dnh03 prefix...

Summary


In this chapter, we laid a general foundation for data modeling that should give you the tools you need to correctly reason about your specific use cases. We covered a lot of ground, including Cassandra's storage engine and how your CQL gets translated to that underlying model, as well as a guide for modeling time-series and geospatial data.

But there are also a number of mistakes people make when modeling data for Cassandra and we will talk about these in the next chapter. Be sure to read on so you can avoid these common pitfalls.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Cassandra High Availability
Published in: Dec 2014Publisher: PacktISBN-13: 9781783989126
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Robbie Strickland

Robbie Strickland has been involved in the Apache Cassandra project since 2010, and he initially went to production with the 0.5 release. He has made numerous contributions over the years, including work on drivers for C# and Scala and multiple contributions to the core Cassandra codebase. In 2013 he became the very first certified Cassandra developer, and in 2014 DataStax selected him as an Apache Cassandra MVP. Robbie has been an active speaker and writer in the Cassandra community and is the founder of the Atlanta Cassandra Users Group. Other examples of his writing can be found on the DataStax blog, and he has presented numerous webinars and conference talks over the years.
Read more about Robbie Strickland