Packt+ | Advance your knowledge in tech

You're reading from Cassandra High Availability

Product typeBook

Published inDec 2014

PublisherPackt

ISBN-139781783989126

Edition1st Edition

Tools

Cassandra

Concepts

Database Administration

Author (1)

Robbie Strickland

Chapter 7. Modeling for High Availability

A well-designed data model is central to availability in Cassandra, while a poorly chosen model can substantially handicap your application's resiliency. This idea may seem counterintuitive to those with backgrounds in relational database systems, but this chapter may very well be the most critical one in this book.

It's not that data models are unimportant in relational systems, but they are especially critical when attempting to maintain availability in a large distributed database. In fact, this topic is probably the least understood and most difficult aspect of transitioning to Cassandra.

The data modeling problem is somewhat exacerbated by a familiar SQL-style syntax that can lure unsuspecting users into believing they already understand the necessary principles. In reality, the similarity between CQL and SQL ends with syntax. The underlying data structure is vastly different, and therefore a new approach to designing your data model is required...

How Cassandra stores data

Database systems use a variety of structures to represent data on disk. Most traditional relational systems use a tabular approach, which enables random access queries supported by these systems. However, in order to achieve Cassandra's hallmark write performance, it must avoid these sorts of random access disk seeks because random disk I/O tends to be a significant bottleneck. Instead, the system employs a log-structured storage engine, which allows it to write data sequentially to both a commit log and to Cassandra's permanent structure, SSTables.

Implications of a log-structured storage

When a write is received, it is written simultaneously to the commit log and to a memtable. Note that the commit log is what provides durability of writes in Cassandra. Memtables are then periodically flushed to disk in the form of immutable SSTables.

This storage scheme has several important implications related to data modeling:

Writes are immutable. Since writes are always essentially...

Understanding compaction

Cassandra deals with this build-up of SSTables over time by means of a process called compaction. Compaction aggregates rows from multiple files into a single file, and in the process it removes old data and purges tombstones. However, housekeeping is only one reason to do this; the other objective is to improve read performance by moving data for a given key into a single SSTable, thereby reducing the disk I/O required to read each key.

The exact mechanism that governs the compaction process depends on which compaction strategy you choose. There are three strategies that currently ship with Cassandra (although you can implement your own):

Size-tiered compaction: This strategy causes SSTables to be compacted when there are multiple files of a similar size (the default is four). In update-heavy workloads, a row may exist in many SSTables at once, resulting in reduced read performance.
Leveled compaction: This strategy assigns SSTables to levels, where each level represents...

CQL under the hood

At this point, most users should be aware that CQL has replaced Thrift as the standard (and therefore recommended) interface to work with Cassandra. However, it remains largely misunderstood, as its resemblance to common SQL has left both Thrift veterans and Cassandra newcomers confused about how it translates to the underlying storage layer. This fog must be lifted if you hope to create data models that scale, perform, and ensure availability.

As we begin this section, it is important to understand that the CQL data representation does not always match the underlying storage structure. This can be challenging for those accustomed to Thrift-based operations, as those were performed directly against the storage layer. However, CQL introduces an abstraction on top of the storage rows and only maps directly in the simplest of schemas.

Tip

If you want to be successful at modeling and querying data in Cassandra, keep in mind that while CQL improves the learning curve, it is not...

Understanding queries

In order to make sense of the various types of queries, we will start with a common data model to be used across the following examples. For this data model, we will return to the authors table, with name as the partition key, followed by year and title as clustering columns. We'll also sort the year in descending order. This table can be created as follows:

CREATE TABLE authors (
  name text,
  year int,
  title text,
  isbn text,
  publisher text,
  PRIMARY KEY (name, year, title)
) WITH CLUSTERING ORDER BY (year DESC);

Also, for the purpose of these examples, we will assume a replication factor of three and consistency level of QUORUM.

Query by key

We'll start with a basic query by key:

SELECT * FROM authors WHERE name = 'Tom Clancy';

For this simple select, the query makes the request to the coordinator node, which in this case owns a replica for our key. The coordinator then retrieves the row from another replica node to satisfy the quorum. Thus, we need a total of two...

How collections are stored

The introduction of collections to CQL addresses some of the concerns that frequently arose regarding Cassandra's primitive data model. They add richer capabilities that give developers more flexibility when modeling certain types of data.

Cassandra supports three collection types: sets, lists, and maps. In this section, we will examine each of these and take a look at how they're stored under the hood. But first, it's important to understand some basic rules regarding collections:

The size of each item in a collection must not be more than 64 KB
A maximum of 64,000 items may be stored in a single collection
Querying a collection always returns the entire collection
Collections are best used for relatively small, bounded datasets

With those rules in mind, we can examine each type of collection in detail, starting with sets.

Sets

A set in CQL is very similar to a set in your favorite programming language. It is a unique collection of items. This means that it does not allow...

Working with time-series data

For most of the last couple decades, data modeling has centered around the relationships among various entities. A person has one account, but one or more phone numbers. That same person has one or more addresses (such as home and work). A person can belong to one or more groups, which can in turn contain many people.

We modeled these relationships using foreign keys and join tables, and we built queries by joining multiple tables together to produce the desired result. However, in recent years, we introduced another dimension to our data: time. Now we're interested in more than just how entities are connected, but how their relationships change over time. For example, while we previously were concerned only about a set of fixed locations associated with a person, we now have mobile phones with GPS radios in pockets and purses all over the world. This makes it possible to produce a timeline of a person's movements by marrying time and location.

Introducing time...

Designing for immutability

An interesting and important difference between modeling relationships versus modeling time-series data is that relational data tends to be mutable whereas time-series data is generally immutable. Mutable data is unstable because it may change at any moment. This makes it more complicated to guarantee we have the most up-to-date version. Immutable data, by contrast, is stable, which means we can avoid many of the complexities associated with data that can change over time.

Tip

If you find yourself struggling with modeling a particular problem in Cassandra, consider reimagining the model as immutable time-series data. This strategy often results in an obvious solution to what appeared to be an intractable problem.

Immutability is a desirable property in a Cassandra data model as updates and deletes can add complexity related to consistency and performance (remember that SSTables are immutable). Often the easiest way to guarantee immutability is to simply add a time...

Working with geospatial data

Another very common use of Cassandra is to store and query geospatial data. Typically, the objective with this type of data is to find points near a given location. The challenge is to find a key that can be used to narrow down the potential list of locations, and to avoid querying many keys at once.

While there is more than one possible data structure that can be used for this purpose, geohashing has a number of benefits that make it worth considering. A geohash is a base 32 representation of a geographic area, where each additional digit represents greater precision. The property of geohashes that makes them particularly suited for geospatial searches is that adding a level of precision to a given geohash results in an area contained within the lower-precision value.

We can visualize this using the following diagram, which shows a geohash, dnh03, with a number of more precise geohashes contained within it. All of the smaller geohashes begin with the dnh03 prefix...

Summary

In this chapter, we laid a general foundation for data modeling that should give you the tools you need to correctly reason about your specific use cases. We covered a lot of ground, including Cassandra's storage engine and how your CQL gets translated to that underlying model, as well as a guide for modeling time-series and geospatial data.

But there are also a number of mistakes people make when modeling data for Cassandra and we will talk about these in the next chapter. Be sure to read on so you can avoid these common pitfalls.

The rest of the chapter is locked

You have been reading a chapter from

Cassandra High Availability

Published in: Dec 2014Publisher: PacktISBN-13: 9781783989126

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Robbie Strickland

Robbie Strickland has been involved in the Apache Cassandra project since 2010, and he initially went to production with the 0.5 release. He has made numerous contributions over the years, including work on drivers for C# and Scala and multiple contributions to the core Cassandra codebase. In 2013 he became the very first certified Cassandra developer, and in 2014 DataStax selected him as an Apache Cassandra MVP. Robbie has been an active speaker and writer in the Cassandra community and is the founder of the Atlanta Cassandra Users Group. Other examples of his writing can be found on the DataStax blog, and he has presented numerous webinars and conference talks over the years.
Read more about Robbie Strickland

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages