Packt+ | Advance your knowledge in tech

You're reading from Learning Apache Cassandra - Second Edition

Product type Book

Published in Apr 2017

Publisher

ISBN-13 9781787127296

Pages 360 pages

Edition 2nd Edition

Languages

Java

Concepts

Databases

Table of Contents (14) Chapters

Getting Up and Running with Cassandra

The First Table

Organizing Related Data

Beyond Key-Value Lookup

Establishing Relationships

Denormalizing Data for Maximum Performance

Expanding Your Data Model

Collections, Tuples, and User-Defined Types

Aggregating Time-Series Data

How Cassandra Distributes Data

Cassandra Multi-Node Cluster

Application Development Using the Java Driver

Peeking under the Hood

Authentication and Authorization

Chapter 3. Organizing Related Data

In Chapter 2, The First Table, we created our first table, which stores user accounts. We discussed how to insert data into the table and how to retrieve it. However, we also encountered several significant limitations in the tasks we can perform with the table we created.

In this chapter, we'll introduce the concept of compound primary keys, which are simply primary keys comprising more than one column. Although this might, at first glance, seem like a trivial addition to our understanding of Cassandra tables, a table with compound primary keys, in fact, is a considerably richer data structure that opens up substantial new data access patterns.

Our introduction to compound primary keys will help us build a table that stores a timeline of users' status updates. In this chapter, we'll focus on defining the table and understanding how it works. Chapter 4, Beyond Key-Value Lookup, will introduce new patterns to query compound primary key tables.

We'll explore...

A table for status updates

In the MyStatus application, we'll begin by creating a timeline of status updates for each user. Users can view their friends' status updates by accessing the timeline of the friend in question.

The user timeline requires a new level of organization, which we didn't see in the users table that we created in the previous chapter. Specifically, we have two requirements:

Rows (individual status updates) should be logically grouped by a certain property (the user who created the update)
Rows should be accessible in a sorted order (in this case, by creation date)

Fortunately, compound primary keys provide exactly these qualities.

Creating a table with a compound primary key

The syntax for creating tables with compound primary keys is a bit different from the single-column primary key syntax we saw in the previous chapter. We create a user_status_updates table with a compound primary key, as follows:

CREATE TABLE "user_status_updates" ( 
  "username" text, 
  "id" timeuuid...

Working with status updates

Now that we've got our status updates table ready, let's create our first status update:

INSERT INTO "user_status_updates" 
("username", "id", "body") 
VALUES ( 
  'alice', 
  76e7a4d0-e796-11e3-90ce-5f98e903bf02, 
  'Learning Cassandra!' 
);

This will look pretty familiar; we specify the table we want to insert data into, the list of columns we're going to provide data for, and the values for these columns in the given order.

Let's give bob a status update too by inserting the following row in the user_status_updates table:

INSERT INTO "user_status_updates" 
("username", "id", "body") 
VALUES ( 
  'bob', 
  97719c50-e797-11e3-90ce-5f98e903bf02, 
  'Eating a tasty sandwich.' 
);

Now we have two rows, each identified by the combination of the username and id columns. Let's take a look at the contents of our table using the following SELECT statement:

SELECT * FROM "user_status_updates";

We'll be able to see the two rows that we inserted, as follows:

Note that, as we saw...

Anatomy of a compound primary key

At this point, it's clear that there's some nuance in the compound primary key that we're missing. Both the username column and the id column affect the order in which rows are returned; however, while the actual ordering of username is opaque, the ordering of id is meaningfully related to the information encoded in the id column.

In the lexicon of Cassandra, username is a partition key. A table's partition key groups rows together into logically related bundles. In the case of our MyStatus application, each user's timeline is a self-contained data structure, so partitioning the table by user is a sound strategy.

We call the id column a clustering column. The job of a clustering column is to determine the ordering of rows within a partition. This is why we observed that within each user's status updates, the rows were returned in a strictly ascending order by timestamp of the id. This is a very useful property since our application will want to display status...

Beyond two columns

We've now seen a table with two columns in its primary key: a partition key and a clustering column. As it turns out, neither of these roles is limited to a single column. A table can define one or more partition key columns and zero or more clustering columns.

Multiple clustering columns

Clustering columns are not limited to one field as specified before. Let's take a look at how multiple clustering columns work and facilitate data ordering. To illustrate this, we will recreate our status updates table so that it is clustered by the date and time when the user updated their status:

CREATE TABLE "user_status_updates_by_datetime" ( 
  "username" text, 
  "status_date" date, 
  "status_time" time, 
  "body" text, 
  PRIMARY KEY ("username", "status_date", "status_time") 
);

We have created a new table user_status_updates_by_datetime as shown next:

Partition key: username, which is a text field.
Clustering columns: status_date and status_time. Rows for a particular username are...

Compound keys represent parent-child relationships

In the What is Cassandra and why Cassandra? section of Chapter 1, Getting Up and Running with Cassandra, you learned that Cassandra is not a relational database despite some surface similarities. Specifically, this means that Cassandra does not have a built-in concept of the relationships between data in different tables. There are no foreign key constraints and there's no JOIN clause available in the SELECT statements; in fact, there is no way to read from multiple tables in the same query whereas relational databases are designed to explicitly account for the relationships between data in different tables, whether they're one-to-one, one-to-many, or many-to-many. Cassandra has no built-in mechanism for describing or traversing inter-table relationships.

That being said, Cassandra's compound primary key structure provides an ample affordance for a particular kind of relationship—the parent-child relationship. This is a specific type of one...

Coupling parents and children using static columns

The parent-child relationships we've encoded in our schema thus far are implicit in the structure of the primary keys but not explicit from Cassandra's standpoint. While we know that the user_status_updates.username column corresponds to the parent primary key users.username, Cassandra itself has no concept of the relationship between the two.

In a relational database, we might make the relationship explicit in the schema using foreign key constraints, but Cassandra doesn't offer anything like this. In fact, if we want to use two different tables for users and user_status_updates, there isn't anything we can do to explicitly encode their relationship in the database schema. However, there is a way to combine user profiles and status updates into a single table while still maintaining the one-to-many relationship between them. To achieve this merger, we'll use a feature of Cassandra tables that we haven't seen before—static columns.

Defining...

Refining our mental model

In the previous chapter, we began to develop a mental model of a Cassandra table that looked like a key-value store where each value is a collection of columns with values. Now that we have seen compound primary keys, we can refine that mental model to take into account the more complex structures we now know how to build, as follows:

We can envision the user_status_updates table as a more robust key-value structure. Our keys are still usernames, but the values are now ordered collections of rows, both identified and ordered by the id clustering column. As with our earlier model, each partition key's data stands alone; to get data from multiple partitions, we have to go looking in multiple places.

Summary

In this chapter, we were introduced to the concept of compound primary keys and learned that a primary key consists of one or more partition keys and, optionally, one or more clustering columns. We saw how partition keys—the only type of key we had previously encountered—can group related rows together and how clustering columns provide an order for these rows within each partition.

Compound primary keys allow us to build a table containing users' status updates because they expose two important structures: grouping of related rows and ordering of rows. In the user_status_updates table, we encoded the relationship between users and their status updates implicitly in the structure of the primary key; the partition key refers to the parent row in the users table. We also explored the use of static columns to make this relationship explicit, storing all the information about users and their status updates in a single table.

In Chapter 4, Beyond Key-Value Lookup, we will dive into new...