Home Data Learning Apache Cassandra - Second Edition

Learning Apache Cassandra - Second Edition

books-svg-icon Book
eBook $39.99 $27.98
Print $48.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $39.99 $27.98
Print $48.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Getting Up and Running with Cassandra
About this book
Cassandra is a distributed database that stands out thanks to its robust feature set and intuitive interface, while providing high availability and scalability of a distributed data store. This book will introduce you to the rich feature set offered by Cassandra, and empower you to create and manage a highly scalable, performant and fault-tolerant database layer. The book starts by explaining the new features implemented in Cassandra 3.x and get you set up with Cassandra. Then you’ll walk through data modeling in Cassandra and the rich feature set available to design a flexible schema. Next you’ll learn to create tables with composite partition keys, collections and user-defined types and get to know different methods to avoid denormalization of data. You will then proceed to create user-defined functions and aggregates in Cassandra. Then, you will set up a multi node cluster and see how the dynamics of Cassandra change with it. Finally, you will implement some application-level optimizations using a Java client. By the end of this book, you'll be fully equipped to build powerful, scalable Cassandra database layers for your applications.
Publication date:
April 2017
Publisher
Packt
Pages
360
ISBN
9781787127296

 

Chapter 1. Getting Up and Running with Cassandra

As an application developer, you have almost certainly worked with databases extensively. You must have built products using relational databases such as MySQL and PostgreSQL, and perhaps experimented with NoSQL databases including a document store such as MongoDB or a key value store such as Redis. While each of these tools has its strengths, you will now consider whether a distributed database such as Cassandra might be the best choice for the task at hand.

In this chapter, we'll begin with the need for NoSQL databases to satisfy the conundrum of ever-growing data. We will see why NoSQL databases are becoming the de facto choice for big data and real-time web applications. We will also talk about the major reasons to choose Cassandra from among the many database options available to you. Having established that Cassandra is a great choice, we'll go through the nuts and bolts of getting a local Cassandra installation up and running. By the end of this chapter, you'll know the following:

  • What big data is and why relational databases are not a good choice
  • When and why Cassandra is a good choice for your application
  • How to install Cassandra on your development machine
  • How to interact with Cassandra using cqlsh
  • How to create a keyspace, table, and write a simple query
 

What is big data?


Big data is a relatively new term which has been gathering steam over the past few years. Big data is a term used for datasets that are relatively large to be stored in a traditional database system or processed by traditional data-processing pipelines. This data could be structured, semi-structured, or unstructured data. The datasets that belong to this category usually scale to terabytes or petabytes of data. Big data usually involves one or more of the following:

  • Velocity: Data moves at an unprecedented speed and must be dealt with it in a timely manner.

For example, online systems, sensors, social media, web clickstream, and so on.

  • Volume: Organizations collect data from a variety of sources, including business transactions, social media, and information from sensor or machine-to-machine data. This could involve terabytes to petabytes of data. In the past, storing it would've been a problem, but new technologies have eased the burden.
  • Variety: Data comes in all sorts of formats ranging from structured data to be stored in traditional databases to unstructured data (blobs) such as images, audio files, and text files.

These are known as the 3Vs of big data.

In addition to these, we tend to associate another term with big data:

  • Complexity: Today's data comes from multiple sources, which makes it difficult to link, match, cleanse, and transform data across systems. However, it's necessary to connect and correlate relationships, hierarchies, and multiple data linkages, or your data can quickly spiral out of control. It must be able to traverse multiple data centers, cloud, and geographical zones.

 

 

Challenges of modern applications


Before we delve into the shortcomings of relational systems to handle big data, let's take a look at some of the challenges faced by modern web-facing and big data applications.

Later, this will give an insight into how NoSQL data stores or Cassandra, in particular, help solve these issues:

  • One of the most important challenges faced by a web-facing application is the ability to handle a large number of concurrent users. Think of a search engine such as Google, which handles millions of concurrent users at any given point of time, or a large online retailer. The response from these applications should be swift even as the number of users keeps on growing.
  • Modern applications need to be able to handle large amounts of data, which can scale to several petabytes of data and beyond. Consider a large social network with a few hundred million users:
    • Think of the amount of data generated in tracking and managing those users
    • Think of how this data can be used for analytics
  • Business-critical applications should continue running without much impact even when there is a system failure or multiple system failures (server failure, network failure, and so on). The applications should be able to handle failures gracefully without any data loss or interruptions.
  • These applications should be able to scale across multiple data centers and geographical regions to support customers from different regions around the world with minimum delay. Modern applications should be implementing fully distributed architectures and should be capable of scaling out horizontally to support any data size or any number of concurrent users.
 

Why not relational databases?


Relational database systems (RDBMS) have been the primary data store for enterprise applications for 20 years. Lately, NoSQL databases have been picking up a lot of steam, and businesses are slowly seeing a shift towards non-relational databases. There are a few reasons why relational databases don't seem like a good fit for modern big data web applications:

  • Relational databases are not designed for clustered solutions. There are some solutions that shard data across servers, but these are fragile, complex, and generally don't work well.

Note

Sharding solutions implemented by RDBMS are as follows:

  • MySQL's product MySQL cluster provides clustering support which adds many capabilities of non-relational systems. It is actually an NoSQL solution that integrates with the MySQL relational database. It partitions the data onto multiple nodes, and the data can be accessed via different APIs.
  • Oracle provides a clustering solution, Oracle RAC, which involves multiple nodes running an Oracle process accessing the same database files. This creates a single point of failure as well as resource limitations in accessing the database itself.
  • They are not a good fit for current hardware and architectures. Relational databases are usually scaled up using larger machines with more powerful hardware and maybe clustering and replication among a small number of nodes. Their core architecture is not a good fit for commodity hardware and thus doesn't work with scale-out architectures.

Note

Scale-out versus scale-up architecture:

  • Scaling out means adding more nodes to a system, such as adding more servers to a distributed database or filesystem. This is also known as horizontal scaling.
  • Scaling up means adding more resources to a single node within the system, such as adding more CPU, memory, or disks to a server. This is also known as vertical scaling.
 

How to handle big data


Now that we are convinced the relational model is not a good fit for big data, let's try to figure out ways to handle big data. These are the solutions that paved the way for various NoSQL databases:

  • Clustering: The data should be spread across different nodes in a cluster. The data should be replicated across multiple nodes in order to sustain node failures. This helps spread the data across the cluster, and different nodes contain different subsets of data. This improves performance and provides fault tolerance.

Note

A node is an instance of database software running on a server. Multiple instances of the same database could be running on the same server.

  • Flexible schema: Schemas should be flexible unlike the relational model and should evolve with the data.
  • Relax consistency: We should embrace the concept of eventual consistency, which means data will eventually be propagated to all the nodes in the cluster (in case of replication). Eventual consistency allows data replication across nodes with minimum overhead. This allows for fast writes with the need for distributed locking.
  • Denormalization of data: Denormalize data to optimize queries. This has to be done at the cost of writing and maintaining multiple copies of the same data.

 

 

What is Cassandra and why Cassandra?


Cassandra is a fully distributed, masterless database, offering superior scalability, and fault tolerance to traditional single-master databases. Compared with other popular distributed databases such as Riak, HBase, and Voldemort, Cassandra offers a uniquely robust and expressive interface for modeling and querying data. What follows is an overview of several desirable database capabilities, with accompanying discussions of what Cassandra has to offer in each category. Apart from the following features, Apache Cassandra has a large user base including some top technology firms such as Apple, Netflix, and Instagram, who also open-source new features. It is being actively developed and has one of largest open source communities with close to 200 contributors and over 20,000 commits.

Horizontal scalability

Horizontal scalability refers to the ability to expand the storage and processing capacity of a database by adding more servers to a database cluster. A traditional single-master database's storage capacity is limited by the capacity of the server that hosts the master instance. If the data set outgrows this capacity and a more powerful server isn't available, the data set must be shared among multiple independent database instances that know nothing of each other. Your application bears responsibility for knowing to which instance a given piece of data belongs.

Cassandra, on the other hand, is deployed as a cluster of instances that are all aware of each other. From the client application's standpoint, the cluster is a single entity; the application need not know, nor care, which machine a piece of data belongs to. Instead, data can be read or written to any instance in the cluster, referred to as a node; this node will forward the request to the instance where the data actually belongs. The result is that Cassandra deployments have an almost limitless capacity to store and process data. When additional capacity is required, more machines can simply be added to the cluster. When new machines join the cluster, Cassandra takes care 
of rebalancing the existing data so that each node in the expanded cluster has a roughly equal share. Also, the performance of a Cassandra cluster is directly proportional to the number of nodes within the cluster. As you keep on adding instances, the read and write throughput will keep increasing linearly.

Note

Cassandra is one of the several popular distributed databases inspired by the Dynamo architecture, originally published in a paper by Amazon. Other widely used implementations of Dynamo include Riak and Voldemort. You can read the original paper at http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf.

High availability

The simplest database deployments are run as a single instance on a single server. This sort of configuration is highly vulnerable to interruption: if the server is affected by a hardware failure or network connection outage, the application's ability to read and write data is completely lost until the server is restored. If the failure is catastrophic, the data on that server might be lost completely.

Master-slave architectures improve this picture a bit. The master instance receives all write operations, and then these operations are replicated to follower instances. The application can read data from the master or any of the follower instances, so a single host becoming unavailable will not prevent the application from continuing to read the data. A failure of the master, however, will still prevent the application from performing any write operations, so while this configuration provides high read availability, it doesn't completely provide high availability.

Cassandra, on the other hand, has no single point of failure for reading or writing data. Each piece of data is replicated to multiple nodes, but none of these nodes holds the authoritative master copy. All the nodes in a Cassandra cluster are peers without a master node. If a machine becomes unavailable, Cassandra will continue writing data to the other nodes that share data with that machine and will queue the operations and update the failed node when it rejoins the cluster. This means that in a typical configuration, multiple nodes must fail simultaneously for there to be any application - visible interruption in Cassandra's availability.

Note

How many copies? When you create a keyspace, Cassandra's version of a database, you specify how many copies of each piece of data should be stored; this is called the replication factor. A replication factor of 3 is a common choice for most use cases.

The image below perfectly illustrates high availability in Cassandra. Clients reconnect to different node in case a node is down:

Write optimization

Traditional relational and document databases are optimized for read performance. Writing data to a relational database will typically involve making in - place updates to complicated data structures on disk, in order to maintain a data structure that can be read efficiently and flexibly. Updating these data structures is a very expensive operation from a standpoint of disk I/O, which is often the limiting factor for database performance. Since writes are more expensive than reads, you'll typically avoid any unnecessary updates to a relational database, even at the expense of extra read operations.

Cassandra, on the other hand, is highly optimized for write throughput and, in fact, never modifies data on disk; it only appends to existing files or creates new ones. This is much easier on disk I/O and means that Cassandra can provide astonishingly high write throughput. Since both writing data to Cassandra and storing data in Cassandra are inexpensive, denormalization carries little cost and is a good way to ensure that data can be efficiently read in various access scenarios.

Note

Because Cassandra is optimized for write volume, you shouldn't shy away from writing data to the database. In fact, it's most efficient to write without reading whenever possible, even if doing so might result in redundant updates.

Just because Cassandra is optimized for writes doesn't make it bad at reads; in fact, a well-designed Cassandra database can handle very heavy read loads with no problem. We'll cover the topic of efficient data modeling in great depth in the next few chapters.

Structured records

The first three database features we looked at are commonly found in distributed data stores. However, databases such as Riak and Voldemort are purely key value stores; these databases have no knowledge of the internal structure of a record that's stored in a particular key. This means useful functions such as updating only part of a record, reading only certain fields from a record, or retrieving records that contain a particular value in a given field are not possible.

Relational databases such as PostgreSQL, document stores such as MongoDB, and, to a limited extent, newer key-value stores such as Redis, do have a concept of the internal structure of their records, and most application developers are accustomed to taking advantage of the possibilities this allows. None of these databases, however, offer the advantages of a masterless distributed architecture.

In Cassandra, records are structured much in the same way as they are in a relational database—using tables, rows, and columns. Thus, applications using Cassandra can enjoy all the benefits of masterless distributed storage while also getting all the advanced data modeling and access features associated with structured records.

Secondary indexes

A secondary index, commonly referred to as an index in the context of a relational database, is a structure allowing efficient lookup of records by some attribute other than their primary key. This is a widely useful capability; for instance, when developing a blog application, you would want to be able to easily retrieve all of the posts written by a particular author. Cassandra supports secondary indexes; while Cassandra's version is not as versatile as indexes in a typical relational database, it's 
a powerful feature in the right circumstances. A query on a secondary index can perform poorly in certain cases; hence, it should be used sparingly and only in certain cases, which we will touch upon later in Chapter 5, Establishing Relationships.

Materialized views

Data modeling principles in Cassandra compel us to denormalize data as much as possible. Prior to Cassandra 3.0, the only way to query on a non-primary key column was to create a secondary index and query on it. However, secondary indexes have a performance trade-off if they contain high cardinality data. Often, high cardinality secondary indexes have to scan data on all the nodes and aggregate them to return the query results. This defeats the purpose of having a distributed system.

To avoid secondary indexes and client-side denormalization, Cassandra introduced the feature of materialized views, which does server side denormalization. You can create views for a base table and Cassandra ensures eventual consistency between the base and view. This lets us do very fast lookups on each view following the normal Cassandra read path. Materialized views maintain a correspondence of one CQL row each in the base and the view, so we need to ensure that each CQL row that is required for the views will be reflected in the base table's primary keys. Although a materialized view allows for fast lookups on non-primary key indexes, this comes at a performance hit to writes. Also, using secondary indexes and materialized views increases the disk usage by a considerable margin. Thus, it is important to take this into consideration when sizing your cluster.

Efficient result ordering

It's quite common to want to retrieve a record set ordered by a particular field; for instance, a photo-sharing service will want to retrieve the most recent photographs in descending order of creation. Since sorting data on the fly is a fundamentally expensive operation, databases must keep information about record ordering persisted on disk in order to efficiently return results in order. In a relational database, this is one of the jobs of a secondary index.

In Cassandra, secondary indexes can't be used for result ordering, but tables can be structured such that rows are always kept sorted by a given column or columns, called clustering columns. Sorting by arbitrary columns at read time is not possible, but the capacity to efficiently order records in any way and to retrieve ranges of records based on this ordering is an unusually powerful capability for a distributed database.

Immediate consistency

When we write a piece of data to a database, it is our hope that that data is immediately available to any other process that may wish to read it. From another point of view, when we read some data from a database, we would like to be guaranteed that the data we retrieve is the most recently updated version. This guarantee is called immediate consistency, and it's a property of most common single-master databases such as MySQL and PostgreSQL.

Distributed systems such as Cassandra typically do not provide an immediate consistency guarantee. Instead, developers must be willing to accept eventual consistency, which means when data is updated, the system will reflect that update at some point in the future. Developers are willing to give up immediate consistency precisely because it is a direct trade-off with high availability.

In the case of Cassandra, that trade-off is made explicit through tunable consistency. Each time you design a write or read path for data, you have the option of immediate consistency with less resilient availability, or eventual consistency with extremely resilient availability. We'll cover consistency tuning in great detail in Chapter 10, How Cassandra Distributes Data.

Discretely writable collections

While it's useful for records to be internally structured into discrete fields, a given property of a record isn't always a single value such as a string or an integer. One simple way to handle fields that contain collections of values is to serialize them using a format such as JSON and then save the serialized collection into a text field. However, in order to update collections stored in this way, the serialized data must be read from the database, decoded, modified, and then written back to the database in its entirety. If two clients try to perform this kind of modification to the same record concurrently, one of the updates will be overwritten by the other. For this reason, many databases offer built-in collection structures that can be discretely updated: values can be added to and removed from collections without reading and rewriting the entire collection. Cassandra is no exception, offering list, set, and map collections, and supporting operations such as append the number 3 to the end of this list. Neither the client nor Cassandra itself needs to read the current state of the collection in order to update it, meaning collection updates are also blazingly efficient.

Relational joins

In real-world applications, different pieces of data relate to each other in a variety of ways. Relational databases allow us to perform queries that make these relationships explicit; for instance, to retrieve a set of events whose location is in the state of New York (this is assuming events and locations are different record types). Cassandra, however, is not a relational database and does not support anything such as joins. Instead, applications using Cassandra typically denormalize data and make clever use of clustering in order to perform the sorts of data access that would use a join in a relational database.

For data sets that aren't already denormalized, applications can also perform client-side joins, which mimic the behavior of a relational database by performing multiple queries and joining the results at the application level. Client-side joins are less efficient than reading data that has been denormalized in advance, but they offer more flexibility. We'll cover both of these approaches in Chapter 6, Denormalizing Data for Maximum Performance.

 

MapReduce and Spark


MapReduce is a technique for performing aggregate processing on large amounts of data in parallel; it's a particularly common technique in data analytics applications. Cassandra does not offer built-in MapReduce capabilities, but it can be integrated with Hadoop in order to perform MapReduce operations across Cassandra data sets, or Spark for real-time data analysis. The DataStax enterprise product provides integration with both of these tools out of the box.Spark is a fast, distributed, and expressive computational engine used for large-scale data processing similar to MapReduce. It is much more efficient than MapReduce and runs with resource managers such as Mesos and Yarn. It can read data from various sources such as Hadoop or Cassandra or even streams such as Kafka. DataStax provides a Spark-Cassandra connector to load data from Cassandra into Spark and run batch computations on the data.

Rich and flexible data model

Cassandra provides an SQL-like syntax to interact with the database. Cassandra Query Language (CQL) presents a familiar row column representation of data. CQL provides a familiar SQL-like table definition with columns and defined data types. Schema is flexible, and new columns can be added while using the existing data. The data model doesn't support features problematic in distributed systems such as joins. On top of this, Cassandra provides other features such as collections to store multiple items in a single column. It also lets you easily define secondary indexes and materialized views for fast lookups on non-primary key columns.

Note

The previous thrift-based interface was closely tied to the internal storage view. This was fairly complex and had a relatively high learning curve to adopt. The CQL interface is much easier to understand because of its similarity to SQL.

 

Lightweight transactions

As discussed before, Cassandra provides eventual consistency rather than immediate consistency, which means data written will eventually be consistent across multiple replicas of the data. This has implications on the data returned by read queries. There is a possibility that reads could return stale data depending on how writes and reads are configured (the consistency levels at which both queries are performed). Strong consistency, which means reading the most recently written value, can be achieved using quorum reads and writes. But what if strong consistency is not enough? What if we have some operations to perform in sequence that must not be interrupted by others, that is, we must perform them one at a time, or make sure that any that we do run concurrently get the same results as if they really were processed independently. Cassandra provides lightweight transactions with linearizable consistency to ensure a transaction isolation level similar to the serializable level offered by RDBMSs. They are also known as compare and set transactions. You can use lightweight transactions instead of durable transactions with eventual/tunable consistency for situations that require the nodes in the distribution system to agree on changes to the data.

Multidata center replication

Another interesting feature provided by Cassandra is the ability to replicate data across multiple data centers or geographical zones in near real-time. This is natively supported by Cassandra and doesn't need to be managed at the application level. Cassandra also provides local consistency levels to ensure cross-region latency doesn't impact client queries. A multiregion cluster can sustain disasters or entire data centers going down. Ideally, there is no need for backups or disaster recovery when running multidata center clusters except for cases of data corruption.

Note

On April 21, 2011, Amazon experienced a large outage in AWS US-East. Some websites were impacted, while others were not. For Netflix, their systems are designed explicitly for these sorts of failures. The SimpleDB, S3, and Cassandra services that Netflix depends upon were not affected by the outage because of the cross-region replication that these services provide.

 

Comparing Cassandra to the alternatives

Now that you've got an in-depth understanding of the feature set that Cassandra offers, it's time to figure out which features are most important to you and which database is the best fit. The following table lists a handful of commonly used databases and key features that they do or don't have:

Feature

Cassandra

PostgreSQL

MongoDB

Redis

Riak

Structured records

Yes

Yes

Yes

Limited

No

Secondary indexes

Yes

Yes

Yes

No

Yes

Discretely writable collections

Yes

Yes

Yes

Yes

No

Relational joins

No

Yes

No

No

No

Built-in MapReduce

No

No

Yes

No

Yes

Fast result ordering

Yes

Yes

Yes

Yes

No

Immediate consistency

Configurable at query level

Yes

Yes

Yes

Configurable at cluster level

Transparent sharding

Yes

No

Yes

No

Yes

No single point of failure

Yes

No

No

No

Yes

High throughput writes

Yes

No

No

Yes

Yes

As you can see, Cassandra offers a unique combination of scalability, availability, and a rich set of features for modeling and accessing data.

 

Installing Cassandra


Now that you're acquainted with Cassandra's substantial powers, you're no doubt chomping at the bit to try it out. Happily, Cassandra is free, open source, and very easy to get running:

Since some of the features are specific to Cassandra 3.0, we will be installing the latest version of Cassandra 3.0.x available for each OS. Also since it is easier to add apt and yum repositories from DataStax, we will be using the DataStax Community version of Cassandra wherever possible. DataStax Community is open source Apache Cassandra with a few added features.

 

Installing the JDK


First, we need to make sure that we have an up-to-date installation of the Java Runtime Environment (JRE). The recommended Java version for Cassandra 3.0 is Oracle Java 1.8. Make sure you have the correct Java Development Kit (JDK) version installed. Go to http://www.oracle.com/technetwork/java/javase/downloads/index.htmland download and install the appropriate JDK. Open the Terminal application, and type the following into the command prompt:

$ java -version

You will see an output that looks similar to the following:

    java version "1.8.0_65"
    Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
    Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)    

Once you've got the right version of Java, you're ready to install Cassandra.

 

Installing on Debian-based systems (Ubuntu)


To install Cassendra on Debian-based systems (Ubuntu), perform the following steps:

  1. Add the repository to the /etc/apt/sources.list.d/cassandra.sources.list:
        $ echo "deb http://debian.datastax.com/community stable main" |  
          sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list 

 

  1. Add the DataStax repository key to your aptitude trusted keys:
        $ curl -L https://debian.datastax.com/debian/repo_key | sudo  
          apt-key add - 
  1. Install the latest package:
        $ sudo apt-get update
        $ sudo apt-get install dsc30
  1. Start Cassandra:
$ sudo service cassandra start
  1. Verify the Cassandra installation:
        $ nodetool status

Your result should look something like the preceding except for a different Host ID.

  1. Stop Cassandra:
$ sudo service cassandra stop
 

Installing on RHEL-based systems


To install Cassendra on RHEL-based systems, perform the following steps:

  1. Add the Apache Cassandra 3.0 repository to /etc/yum.repos.d/datastax.repo:
        [datastax]
        name = DataStax Repo for Apache Cassandra
        baseurl = http://rpm.datastax.com/community
        enabled = 1
        gpgcheck = 0
  1. Install the latest packages:
$ sudo yum install dsc30
  1. Start Cassandra:
$ sudo service cassandra start

On some Linux distributions, you many need to use the following:

$ sudo /etc/init.d/cassandra start
 

Installing on Windows


The easiest way to install Cassandra on Windows is to use the DataStax Community Edition. DataStax is a company that provides enterprise-level support for Cassandra; they also release Cassandra packages at both free and paid tiers. The DataStax Community Edition is free and does not differ from the Apache package in any meaningful way.

DataStax offers a graphical installer for Cassandra on Windows, which is available for download at http://planetcassandra.org/cassandra.

On this page, locate MSI Installer (32 bit or 64 bit depending on your operating system) under Download DataStax Community Edition v3.0.9. Install the downloaded binary:

Note

From Windows PowerShell, set the execution policy to unrestricted from an elevated command-prompt:C:> powershell Set-ExecutionPolicy Unrestricted

  1. Follow the setup wizard to install.
  2. The installation wizard lets you set up the DataStax services to start automatically when the installation completes and whenever the computer reboots. If you select this option, the services start automatically—you can skip the next step.

 

  1. If you did not elect to have Datastax services start automatically, start Cassandra by entering the following in Command Prompt:
        C:> net start DataStax_Cassandra_Community_Server
        C:> net start DataStax_DDC_Server

 

  • To verify that Apache Cassandra 3.0 for Windows is running, use the nodetool status command,

For example:

        C:> cd %CASSANDRA_HOME%
        C:> bin\nodetool status

The following is the output:

 

  • Stop Cassandra:
        C:> net stop DataStax_Cassandra_Community_Server
 

Installing on Mac OS X


You'll need to set up your environment so that Cassandra knows where to find the latest version of Java. To do this, set up your JAVA_HOME environment variable to the install location, and your PATH to include the executable in your new Java installation, as follows:

$ export JAVA_HOME=" /Library/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home"
$ export PATH="$JAVA_HOME/bin":$PATH

You should put these two lines at the bottom of your .bashrc or .bash_profile file to ensure that things still work when you open a new terminal.

Note

The installation instructions given earlier assume that you're using the latest version of Mac OS X (at the time of writing this, 10.11.6 El Capitan). If you're running a different version of OS X, installing Java might require different steps. Check out https://www.java.com/en/download/faq/java_mac.xml for detailed installation information.

Once you've got the right version of Java, you're ready to install Cassandra. It's very easy to install Cassandra using Homebrew; simply type the following:

$ brew install cassandra
$ pip install cassandra-driver cql
$ cassandra

Here's what we just did:

  • Installed Cassandra using the Homebrew package manager
  • Installed the CQL shell and its dependency, the Python Cassandra driver
  • Started the Cassandra server
 

Installing the binary tarball


You can use the binary tarball to install Cassandra on any Linux-based platform including Mac OS X and platforms without package support or if you do not want a root installation:

  1. Download the Apache Cassandra 3.0.9 binary tarball from the following:

http://www.apache.org/dyn/closer.lua/cassandra/3.0.9/apache-cassandra-3.0.9-bin.tar.gz.

 

  1. Use the following command to untar:
$ tar -xvzf apache-cassandra-3.0.9-bin.tar.gz
  1. To configure Cassandra, go to the $INSTALL_LOCATION/conf directory and make the relevant changes. You can do this once you get an idea of Cassandra internals later in the book. INSTALL_LOCATION in this case will be $CURRENT_DIRECTORY/apache-cassandra-3.0.9/.
  1. Start Cassandra:
        $ cd apache-cassandra-3.0.9/
        $ bin/cassandra # Use -f to start Cassandra in the foreground
  1. Verify that Cassandra is running:
$ bin/nodetool status

 

Note

You might want to add CASSANDRA_HOME=$INSTALL_LOCATION and PATH=$PATH:$INSTALL_LOCATION/bin in the .bashrc or .bash_profile files so every time you open a new terminal, you can simply launch cassandra, nodetool, or cqlsh by entering the command on terminal without changing directory every time.

 

Bootstrapping the project


Throughout the remainder of this book, we will build an application called MyStatus, which allows users to post status updates for their friends to read. In each chapter, we'll add new functionality to the MyStatus application; each new feature will also introduce a new aspect of Cassandra.

 

CQL—the Cassandra Query Language


Since this is a book about Cassandra and not targeted to users of any particular programming language or application framework, we will focus entirely on the database interactions that MyStatus will require. Code examples will be in Cassandra Query Language (CQL). Specifically, we'll use version 3.4.0 of CQL, which is available in Cassandra 3.0 and later versions.

As the name implies, CQL is heavily inspired by SQL; in fact, many CQL statements are equally valid SQL statements. However, CQL and SQL are not interchangeable. CQL lacks grammar for relational features such as JOIN statements, which are not possible in Cassandra. Conversely, CQL is not a subset of SQL; constructs for retrieving the update time of a given column, or performing an update in a lightweight transaction, which are available in CQL, do not have an SQL equivalent.

Note

Throughout this book, you'll learn the important constructs of CQL. Once you've completed reading this book, I recommend you to turn to the DataStax CQL documentation for additional reference. This documentation is available at http://www.datastax.com/documentation/cql/3.3.

 

 

Interacting with Cassandra


Most common programming languages have drivers for interacting with Cassandra. When selecting a driver, you should look for libraries that support the CQL binary protocol, which is the latest and most efficient way to communicate with Cassandra.

Note

The CQL binary protocol is a relatively new introduction; older versions of Cassandra used the Thrift protocol as a transport layer. Although Cassandra continues to support Thrift, avoid Thrift-based drivers as they are less performant than the binary protocol.

Here are the CQL binary drivers available for some popular programming languages:

Language

Driver

Available at

Java

DataStax Java Driver

https://github.com/datastax/java-driver

Python

DataStax Python Driver

https://github.com/datastax/python-driver

Ruby

DataStax Ruby Driver

https://github.com/datastax/ruby-driver

C++

DataStax C++ Driver

https://github.com/datastax/cpp-driver

C#

DataStax C# Driver

https://github.com/datastax/csharp-driver

JavaScript (Node.js)

node-cassandra-cql

https://github.com/jorgebay/node-cassandra-cql

PHP

phpbinarycql

https://github.com/rmcfrazier/phpbinarycql

While you are likely to use one of these drivers in your applications, to try out the code examples in this book, you can simply use the cqlsh tool, which is a command-line interface for executing CQL queries and viewing the results. To start cqlsh on OS X or Linux, simply type cqlsh into your command line; you should see something like this:

$ cqlsh
Connected to Test Cluster at 127.0.01:9042.
[cqlsh 5.0.1 | Cassandra 3.0.9 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
cqlsh>

On Windows, you can start cqlsh just the way you ran nodetool:

C:> cd %CASSANDRA_HOME%
C:> bin\cqlsh

Once you open it, you should see the same output we just saw.

 

Getting started with CQL


To get started with CQL, we will create a simple keyspace and table. We will insert a record into the table and read it back. Let's create a simple table which stores some personal information of a social network user.

Creating a keyspace

A keyspace is a collection of related tables equivalent to a database in a relational system. To create keyspace, issue the following statement in the CQL shell:

cqlsh> CREATE KEYSPACE "users"
WITH REPLICATION = {
 'class': 'SimpleStrategy', 'replication_factor': 1
};

Here, we created a keyspace called users. When we create a keyspace, we have to specify replication options. Cassandra provides several strategies for managing replication of data; SimpleStrategy is the best strategy as long as your Cassandra deployment does not span across multiple data centers. The replication_factor value tells Cassandra how many copies of each piece of data are to be kept in the cluster; since we are only running a single instance of Cassandra, there is no point in keeping more than one copy of the data. In a production deployment, you would certainly want a higher replication factor; three is a good place to start.

Note

A few things at this point are worth noting about CQL's syntax:

  • It's syntactically very similar to SQL; as we further explore CQL, the impression of similarity will not diminish.
  • Double quotes are used for identifiers such as keyspace, table, and column names. As in SQL, quoting identifier names is usually optional, unless the identifier is a keyword or contains a space or another character that will trip up the parser.
  • Single quotes are used for string literals; the key value structure we use for replication is a map literal, which is syntactically similar to an object literal in JSON.

Selecting a keyspace

Once you've created a keyspace, you would want to use it. In order to do this, employ the USE command:

cqlsh> USE "users";

This tells Cassandra that all future commands will implicitly refer to tables inside the users keyspace. If you close the CQL shell and reopen it, you'll need to reissue this command.

Creating a table

Let's create a table within our users keyspace to store personal information. You can create a table by issuing the following command:

CREATE TABLE "personal_info" (id int PRIMARY KEY, name text, dob text);

Note

I will be omitting cqlsh> in the text from now on. You should always run the commands after entering cqlsh.

So we created the table personal_info with three columns: id , which is a unique integer identifier for a user which also happens to be the primary key for this table, name, and dob (date of birth) columns which are text values (strings).

Inserting and reading data

To insert data into the table, run the following command:

INSERT INTO personal_info (id, name, dob) VALUES ( 1 , 'Alice' , '02-25-1954' );

This will insert a record for a user named Alice whose date of birth is 02-25-1954 and has been assigned the id1. To read the data from the table, run the following query:

SELECT * FROM personal_info WHERE id = 1;

You should get an output that looks like this:

Voila! You have created your first keyspace and table, inserted a record, and queried the record back.

 

New features in Cassandra 2.2, 3.0, and 3.X


Since the first edition of this mostly covered Cassandra versions 2.1 and below, here is a list of features and improvements that have been made to Cassandra starting version 2.2 and beyond. This will give you a gist of how Cassandra has matured over the last year:

  • JSON in CQL3: Cassandra 2.2 has support for inserting and selecting JSON data
  • User-defined functions: These can be defined to apply a function to data stored in Cassandra
  • User-defined aggregates: Using user-defined functions, custom aggregation functions can be stored in Cassandra
  • Role-based access control: In addition to per-user access control, now roles can be defined for role-based access control
  • Support for Windows 7, Windows 8, Windows 10, Windows Server 2008, and Windows Server 2012
  • The storage engine has been refactored
  • Materialized views: Materialized views have been added to handle server side denormalization, with consistency between base and view
  • G1 garbage collector: The default garbage collector has been changed to G1 from CMS (concurrent mark and sweep), which has markedly increased performance for higher values of JVM heap sizes
  • Hints are stored in files and replay has been improved

Note

Apart from these, there have been a plethora of operational improvements. You can take a look at them at https://docs.datastax.com/en/cassandra/3.x/cassandra/features.html.

 

Summary


In this chapter, you explored the reasons to choose Cassandra from among the many databases available, and having determined that Cassandra is a great choice, you installed it on your development machine.

You had your first taste of the Cassandra Query Language when you issued your first few commands via the CQL shell in order to create a keyspace table, and insert and read data. You're now ready to begin working with Cassandra in earnest.

In the next chapter, we'll begin building the MyStatus application, starting out with a simple table to model users. We'll cover a lot more CQL commands, and before you know it, you'll be reading and writing data like a pro.

Latest Reviews (2 reviews total)
After creating an account, the purchase was clear and easy. Content quality is mixed, but overall I am pleased, especially with the Packt website.
I have no started to read the book, but I'm shure it's going to be good.
Learning Apache Cassandra - Second Edition
Unlock this book and the full library FREE for 7 days
Start now