Packt+ | Advance your knowledge in tech

You're reading from Seven NoSQL Databases in a Week

Product type Book

Published in Mar 2018

Publisher Packt

ISBN-13 9781787288867

Pages 308 pages

Edition 1st Edition

Languages

Concepts

Database Programming

Authors (2):

Sudarshan Kadambi

Xun (Brian) Wu

View More author details

Table of Contents (16) Chapters

Title Page

Dedication

Packt Upsell

Contributors

Preface

Introduction to NoSQL Databases

MongoDB

Neo4j

Redis

Cassandra

HBase

DynamoDB

InfluxDB

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Chapter 5. Cassandra

Apache Cassandra is one of the most widely used NoSQL databases. It is used by many enterprise organizations to store and serve data on a large scale. Cassandra is also designed to continue to function through partial node failure, remaining highly available to serve consistently high amounts of traffic.

In this chapter, we will discuss Cassandra at length, going into detail on several topics, including:

Key features
Appropriate use cases
Anti-patterns
How to leverage Cassandra with different tools and languages, such as:
- Nodetool
- CQLSH
- Python
- Java

By the end of this chapter, you will understand how to build and architect a Cassandra cluster from the ground up. You will find tips and best practices on hardware selection, installation, and configuration. You will also learn how to perform common operational tasks on a Cassandra cluster, such as adding or removing a node, and taking or restoring from a snapshot.

We will start with a quick introduction to Cassandra, and illustrate...

Introduction to Cassandra

Cassandra is an open source, distributed, non-relational, partitioned row store. Cassandra rows are organized into tables and indexed by a key. It uses an append-only, log-based storage engine. Data in Cassandra is distributed across multiple masterless nodes, with no single point of failure. It is a top-level Apache project, and its development is currently overseen by the Apache Software Foundation (ASF).

Each individual machine running Cassandra is known as a node. Nodes configured to work together and support the same dataset are joined into a cluster (also called a ring). Cassandra clusters can be further subdivided based on geographic location, by being assigned to a logical data center (and potentially even further into logical racks.) Nodes within the same data center share the same replication factor, or configuration, that tells Cassandra how many copies of a piece of data to store on the nodes in that data center. Nodes within a cluster are kept informed...

What problems does Cassandra solve?

Cassandra is designed to solve problems associated with operating at a large (web) scale. It was designed under similar principles discussed in Amazon's Dynamo paper,^{[7, p.205]} where in a large, complicated system of interconnected hardware, something is always in a state of failure. Given Cassandra's masterless architecture, it is able to continue to perform operations despite a small (albeit significant) number of hardware failures.

In addition to high availability, Cassandra also provides network partition tolerance. When using a traditional RDBMS, reaching the limits of a particular server's resources can only be solved by vertical scaling or scaling up. Essentially, the database server is augmented with additional memory, CPU cores, or disks in an attempt to meet the growing dataset or operational load. Cassandra, on the other hand, embraces the concept of horizontal scaling or scaling out. That is, instead of adding more hardware resources to a server...

What are the key features of Cassandra?

There are several features that make Cassandra a desirable data store. Some of its more intrinsic features may not be overtly apparent to application developers and end users. But their ability to abstract complexity and provide performance ultimately aims to improve the experience on the application side. Understanding these features is paramount to knowing when Cassandra can be a good fit on the backend.

No single point of failure

In Cassandra, multiple copies of the data are stored on multiple nodes. This design allows the cluster (and the applications that it serves) to continue to function in the event of a loss of one or more nodes. This feature allows Cassandra to remain available during maintenance windows and even upgrades.

Tunable consistency

Although Cassandra embraces the AP side of the CAP theorem triangle, it does allow the level of consistency to be adjusted. This especially becomes useful in multi-tenant clusters, where some applications...

Appropriate use cases for Cassandra

There are several known, good use cases for Cassandra. Understanding how Cassandra's write path works can help you in determining whether or not it will work well for your use case:

Cassandra applies writes both in memory and on disk.

Note

The commit log exists to provide durability. If a Cassandra node experiences a plug-out-of-the-wall event, the commit log is verified against what is stored on disk when the Cassandra process is restarted. If there was any data stored in memory that had not yet been persisted to disk, it is replayed from the commit log at that time.

Overview of the internals

The preceding figure showed that write is stored both in memory and on disk. Periodically, the data is flushed from memory to disk:

Note

The main thing to remember is that Cassandra writes its sorted string data files (SSTable files) as immutable. That is, they are written once, and never modified. When an SSTable file reaches its maximum capacity, another is written. Therefore...

Cassandra anti-patterns

Cassandra is a great tool for solving specific problems, but it is not a general-purpose data store. Considering the prior section where we discussed the read and write paths, there are some obvious scenarios in which Cassandra is not the correct choice of the data store. These are important to remember, and we will discuss them in this section:

Cassandra reconciles data returned from both memory, disk, and read-time.

Frequently updated data

Primary keys in Cassandra are unique. Therefore there is no difference between an insert and an update in Cassandra; they are both treated as a write operation. Given that its underlying data files are immutable, it is possible that multiple writes for the same key will store different data in multiple files. The overwritten data doesn't automatically go away. It becomes obsolete (due to its timestamp).

When Cassandra processes a read request, it checks for the requested data from both memory and disk. If the requested data was written...

Cassandra hardware selection, installation, and configuration

Cassandra is designed to be run in the cloud or on commodity hardware, so (relative to relational databases) you usually don't need to worry about breaking the bank on expensive, heavy-duty hardware. Most documentation on hardware recommendations for Cassandra is somewhat cryptic and reluctant to put forth any solid numbers on hardware requirements. The Apache Cassandra project documentation^[1] has a section titled Hardware Choices, which states:

While Cassandra can be made to run on small servers for testing or development environments (including Raspberry Pis), a minimal production server should have at least 2 cores and 8 GB of RAM. Typical production servers have 8 or more cores and 32 GB of RAM.

RAM

One aspect to consider is that Cassandra runs on a JVM. This means that you need to have at least enough random access memory (RAM) to hold the JVM heap, plus another 30-50% or so for additional OS processes and off-heap storage...

Node configuration

To configure your node properly, you will need your machine's IP address (assume 192.168.0.100, for this exercise). Once you have that, look inside your configuration directory ($CASSANDRA_HOME/conf for Tarball installs, /etc/cassandra for apt-get installs) and you will notice several files: cassandra.yaml, cassandra-env.sh, and cassandra-rackdc.properties among them.

In the cassandra.yaml file, make the following adjustments:

I'll name the cluster PermanentWaves:

cluster_name: "PermanentWaves"

Next, I'll designate this node as a seed node. Basically, this means other nodes will look for this node when joining the cluster. Do not make all of your nodes seed nodes:

seeds: "192.168.0.100"

Usually, listen_address and rpc_address will be set to the same IP address. In some cloud implementations, it may be necessary to also set broadcast_address and/or broadcast_rpc_address to your instances' external IP address, instead. But for a basic, on-the-metal setup, this will work fine:

listen_address...

Running Cassandra

With the configuration complete, start up your node.

If you used the Tarball install:

bin/cassandra -p cassandra.pid

Note

It is not recommended to run Cassandra as the root user or as a user with access to sudo. You should create a Cassandra user, and ensure that Cassandra runs as that user instead.

If you used apt-get for the install:

sudo service cassandra start

Cassandra comes with the nodetool utility, which is very useful for performing operational duties, as well as for assessing the health of your node and/or cluster. To verify that your node has started, running a nodetool status should return information on your cluster, based on the gossip information held by the node that you are logged into:

nodetool status

Datacenter: LakesidePark
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.0.100 1.33 MiB 256 100.0% 954d394b-f96f-473f-ad23-cbe4fd0672c8 R40

See the Using Cassandra section for...

Using Cassandra

Now that we have a running cluster, we will cover some simple examples to explore some of Cassandra's basic functionality. This section will introduce command-line tools, such as nodetool and CQLSH, as well as examples for interacting with Cassandra via Python and Java.

Nodetool

Nodetool is Cassandra's collection of delivered tools that help with a variety of different operational and diagnostic functions. As previously mentioned, probably the most common nodetool command that you will run is nodetool status, which should produce output similar to this:

$ nodetool status
Datacenter: LakesidePark
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address        Load       Tokens  Owns    Host ID                          Rack
UN 192.168.0.100  84.15 MiB  16     100.0%  71700e62-2e28-4974-93e1-a2ad3f... r40
UN 192.168.0.102  83.27 MiB  16     100.0%  c3e61934-5fc1-4795-a05a-28443e... r40
UN 192.168.0.101  83.99 MiB  16     100.0%  fd352577-6be5-4d93...

Tips for success

Following are some tips you may need while using Cassandra:

Run Cassandra on Linux

Cassandra may work on Windows, but remember that this is a fairly new development in the Cassandra world. If you want the best chance of building a successful cluster, build it on Linux.

Open ports 7199, 7000, 7001, and 9042

Cassandra needs 7199 for JMX (nodetool), 7000 for gossip, 7001 for gossip over SSL, and 9042 for native binary (client connections). You shouldn't need Thrift (port 9160), so don't open the port or enable the protocol unless you have a reason to.

Enable security

At the very least, you should enable authorization and authentication.

Use solid state drives (SSDs) if possible

The primary bottleneck on Cassandra is disk I/O, and SSDs will help you to mitigate that. The cassandra.yaml file also contains some specific settings for optimizing an instance backed by SSDs, so be sure to look those up and activate them where appropriate. Never use a NAS or SAN for Cassandra.

Configure only...

Summary

In this chapter, we introduced the Cassandra database, and discussed its features and acceptable use cases, as well as providing some example code for working with it. Cassandra has some intrinsic features that certainly make it a desirable backend data store. It can serve data in an environment with no single point of failure, as well as provide tunable consistency, linear scalability, and best-in-class data center awareness.

Common use cases for Cassandra include time series or event-driven data. It has also shown its ability to support large datasets, which continue to grow over time. Applications backed by Cassandra are successful when they use tables designed to fit well-defined, static query patterns.

It is important to remember that Cassandra also has its share of anti-patterns. Cassandra typically does not perform well with use cases architected around frequently updating or deleting data, queue-like functionality, or access patterns requiring query flexibility. Improper setup...

References

Apache Software Foundation (2016). Cassandra Documentation - Operations - Hardware Choices. Apache SoftwareFoundation - Cassandra project site. Retrieved on 20170603 from: http://cassandra.apache.org/doc/latest/operating/hardware.html
Apache Software Foundation (2016). Cassandra Documentation - Downloads.Apache Software Foundation - Cassandraprojectsite. Retrieved on 20170603 from:http://cassandra.apache.org/download/
Brenner B. (2017). Thousands of MongoDB databases compromised and held to ransom.Naked Security by Sophos. Retrieved on 20170604 from:https://nakedsecurity.sophos.com/2017/01/11/thousands-of-mongodb-databases-compromised-and-held-to-ransom/
Brewer E., Fox, A. (1999). Harvest, Yield, and Scalable Tolerant Systems.University of California at Berkeley. Berkeley, CA Doi: 10.1.1.24.3690. Retrieved on 20170530 from:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.3690&rep=rep1&type=pdf