Apache Cassandra: Libraries and Applications

You can mine deep into the full capabilities of Apache Cassandra using the 150+ recipes in this indispensable Cookbook. From configuring and tuning to using third party applications, this is the ultimate guide.


Cassandra High Performance Cookbook

Cassandra High Performance Cookbook

Over 150 recipes to design and optimize large scale Apache Cassandra deployments

        Read more about this book      

(For more resources related to this subject, see here.)


Cassandra's popularity has led to several pieces of software that have developed around it. Some of these are libraries and utilities that make working with Cassandra easier. Other software applications have been built completely around Cassandra to take advantage of its scalability. This article describes some of these utilities.

Building Cassandra from source

The Cassandra code base is active and typically has multiple branches. It is a good practice to run official releases, but at times it may be necessary to use a feature or a bug fix that has not yet been released. Building and running Cassandra from source allows for a greater level of control of the environment. Having the source code, it is also possible to trace down and understand the context or warning or error messages you may encounter. This recipe shows how to checkout Cassandra code from Subversion (SVN) and build it.

How to do it...

  1. Visit http://svn.apache.org/repos/asf/cassandra/branches with a web browser. Multiple sub folders will be listed:


    Each folder represents a branch. To check out the 0.6 branch:

    $ svn co http://svn.apache.org/repos/asf/cassandra/branches/

  2. Trunk is where most new development happens. To check out trunk:

    $ svn co http://svn.apache.org/repos/asf/cassandra/trunk/

  3. To build the release tar, move into the folder created and run:

    $ ant release

    This creates a release tar in build/apache-cassandra-0.6.5-bin.tar.gz, a release jar, and an unzipped version in build/dist.

How it works...

Subversion (SVN) is a revision control system commonly used to manage software projects. Subversion repositories are commonly accessed via the HTTP protocol. This allows for simple browsing. This recipe is using the command-line client to checkout code from the repository.

Building the contrib stress tool for benchmarking

Stress is an easy-to-use command-line tool for stress testing and benchmarking Cassandra. It can be used to generate a large quantity of requests in short periods of time, and it can also be used to generate a large amount of data to test performance with. This recipe shows how to build it from the Cassandra source.

Getting ready

Before running this recipe, complete the Building Cassandra from source recipe discussed above.

How to do it...

From the source directory, run ant. Then, change to the contrib/stress directory and run ant again.

$ cd <cassandra_src>
$ ant jar
$ cd contrib/stress
$ ant jar
Total time: 0 seconds

How it works...

The build process compiles code into the stress.jar file.

Inserting and reading data with the stress tool

The stress tool is a multithreaded load tester specifically for Cassandra. It is a command-line program with a variety of knobs that control its operation. This recipe shows how to run the stress tool.

Before you begin...

See the previous recipe, Building the contrib stress tool for benchmarking before doing this recipe.

How to do it...

Run the <cassandra_src>/bin/stress command to execute 10,000 insert operations.

$ bin/stress -d,, -n 10000 --operation
INSERT Keyspace already exists.

How it works...

The stress tool is an easy way to do load testing against a cluster. It can insert or read data and report on the performance of those operations. This is also useful in staging environments where significant volumes of disk data are needed to test at scale. Generating data is also useful to practice administration techniques such as joining new nodes to a cluster.

There's more...

It is best to run the load testing tool on a different node than on the system being tested and remove anything else that causes other unnecessary contention.

Running the Yahoo! Cloud Serving Benchmark

The Yahoo! Cloud Serving Benchmark (YCSB) provides benchmarking for the bases of comparison between NoSQL systems. It works by generating random workloads with varying portions of insert, get, delete, and other operations. It then uses multiple threads for executing these operations. This recipe shows how to build and run the YCSB.

Information on the YCSB can be found here:


How to do it...

  1. Use the git tool to obtain the source code.

    $ git clone git://github.com/brianfrankcooper/YCSB.git

  2. Build the code using the ant.

    $ cd YCSB/
    $ ant

  3. Copy the JAR files from your <cassandra_hom>/lib directory to the YCSB classpath.

    $ cp $HOME/apache-cassandra-0.7.0-rc3-1/lib/*.jar db/
    $ ant dbcompile-cassandra-0.7

  4. Use the Cassandra CLI to create the required keyspace and column family.

    [default@unknown] create keyspace usertable with replication_
    [default@unknown] use usertable;
    [default@unknown] create column family data;

  5. Create a small shell script run.sh to launch the test with different parameters.

    for i in db/cassandra-0.7/lib/*.jar ; do

    java -cp $CP com.yahoo.ycsb.Client -t -db com.yahoo.ycsb.
    db.CassandraClient7 -P workloads/workloadb \
    -p recordcount=10 \
    -p hosts=, \
    -p operationcount=10 \

  6. Run the script ant pipe the output to more command to control pagination:

    $ sh run.sh | more
    YCSB Client 0.1
    Command line: -t -db com.yahoo.ycsb.db.CassandraClient7 -P
    workloads/workloadb -p recordcount=10 -p hosts=,
    -p operationcount=10 -s
    Loading workload...
    Starting test.
    0 sec: 0 operations;
    0 sec: 10 operations; 64.52 current ops/sec; [UPDATE
    AverageLatency(ms)=30] [READ AverageLatency(ms)=3]
    [OVERALL], RunTime(ms), 152.0
    [OVERALL], Throughput(ops/sec), 65.78947368421052
    [UPDATE], Operations, 1
    [UPDATE], AverageLatency(ms), 30.0
    [UPDATE], MinLatency(ms), 30
    [UPDATE], MaxLatency(ms), 30
    [UPDATE], 95thPercentileLatency(ms), 30
    [UPDATE], 99thPercentileLatency(ms), 30
    [UPDATE], Return=0, 1

How it works...

YCSB has many configuration knobs. An important configuration option is -P, which chooses the workload. The workload describes the portion of read, write, and update percentage. The -p option overrides options from the workload file. YCSB is designed to test performance as the number of nodes grows and shrinks, or scales out.

There's more...

Cassandra has historically been one of the strongest performers in the YCSB.

        Read more about this book      

(For more resources related to this subject, see here.)

Hector, a high-level client for Cassandra

It is suggested that when available, clients should use a higher level API. Hector is one of the most actively developed higher level clients. It works as a facade over the Thrift API, and in many cases condenses what is a large section of Thrift code into a shorter version using Hector's helper methods and design patterns. This recipe shows how to use Hector to communicate with Cassandra.

How to do it...

Download the Hector JAR and place it in your applications classpath.

$wget https://github.com/downloads/rantav/hector/hector-core-0.7.0-26.tgz
$cp hector-core* <hpc_build>/lib

Open <hpc_build>src/hpcas/c10/HectorExample.java in a text editor.

public class HectorExample {

Hector uses serializers. The role of a Serializer is to take the encoding burden away from the user. Internally, the StringSerializer will do something similar to "string". getBytes("UTF-8").

private static StringSerializer stringSerializer =
public static void main(String[] args) throws Exception {

Hector has its own client-side load balancing. The host list for Hfactory. getOrCreateCluster can be one or more host:port pairs separated by commas.

Cluster cluster = Hfactory.getOrCreateCluster
("TestCluster", Util.envOrProp("targetsHost"));
Keyspace keyspaceOperator = HFactory.createKeyspace(Util.
envOrProp("ks33", cluster);

The HFactory object has several factory methods. HFactory.createStringColumn is a one-liner for creating columns. This is an alternative to working with the Column in a JavaBean-like style.

Mutator<String> mutator = Hfactory.createMutator
(keyspaceOperator, StringSerializer.get());
mutator.insert("bbrown", "cf33", HFactory.
createStringColumn("first", "Bob"));

One way to read data is by using a ColumnQuery object. ColumnQuery uses a builder pattern where set operations return an instance to the ColumnQuery object instead of void.

ColumnQuery<String, String, String> columnQuery =
QueryResult<HColumn<String, String>> result = columnQuery.execute();
System.out.println("Resulting column from cassandra: " + result.

How it works...

Hector provides a few key things. Firstly, remember that the bindings generated by Thrift are cross-platform and designed for compatibility. Higher level clients such as Hector bring more abstraction and take more advantage of language features such as Java's generics. For example, the HFactory class provides methods that reduce four lines of Thrift code to a single line factory method call. Hector also provides client-side load balancing because detecting and automatically failing-over between servers is important to achieve good uptime.

Doing batch mutations with Hector

We know how batch mutations are much more efficient than doing individual inserts. However, the long complex method signature of the batch_mutate method is difficult to read and assembling that structure may clutter code. This recipe shows how to use the Hector API for the batch mutate operation.

How to do it...

  1. Create a text file <hpc_build>rc/hpcas/c10/HectorBatchMutate.java.

    public class HectorBatchMutate {
    final StringSerializer serializer = StringSerializer.get();
    public static void main(String[] args) throws Exception {
    Cluster cluster = HFactory.getOrCreateCluster("test", Util.
    Keyspace keyspace = HFactory.createKeyspace(Util.
    envOrProp("ks"), cluster);

  2. Create a mutator as you would for a single insert and make multiple calls to the addInsertion method.

    Mutator m = Hfactory.createMutator(keyspace,serializer);
    m.addInsertion("keyforbatchload", Util.envOrProp("ks"),
    HFactory.createStringColumn("age", "30"));
    m.addInsertion("keyforbatchload", Util.envOrProp("ks"),
    HFactory.createStringColumn("weight", "190"));

    The writes are not sent to Cassandra until the execute method is called.


How it works...

Hector's mutator concept is more straightforward than the elaborate nested object needed to execute a batch mutation through Thrift. Writing less lines of code to carry out a task is better in numerous ways as there is less code to review and less chance to make a mistake.

Cassandra with Java Persistence Architecture (JPA)

Data in memory being used by an application is typically in a different format than it's on-disk representation. Serialization and deserialization take data from an in-memory form and persist it to a back-end data store. This work can be done by hand. The Java Persistence Architecture (JPA) allows you to annotate a Java object and use JPA to handle the serialization and de serialization automatically. This recipe show how to use JPA annotation to persist data to Cassandra.

Before you begin...

This recipe requires the mvm command provided by the maven2 package.

How to do it...

  1. Use subversion to download the kundera source code and maven to build it.

    $ svn checkout http://kundera.googlecode.com/svn/trunk/kunderaread-
    $ cd kundera-read-only
    $ mvn install

  2. Create a text file <hpc_build>/src/hpcas/c10/Athlete.java.

    package hpcas.c10;

  3. Apply the Entity annotation. Then, use the columnFamily annotation and supply the column family name.

    @Index (index=false)
    public class Athlete {

  4. Use the @id annotation to signify the row key.

    String username;

  5. Any field with the @Column annotation will be persisted. Optionally, you can supply a string for the column name.

    @Column(name = "email")
    String emailAddress;
    String country;
    public Athlete() {
    ... //bean patterns

  6. Kundera can configure itself from a Java properties file or from a Map defined in your code. Create a file <hpc_build>src/hpcas/c10/AthleteDemo.java.

    public class AthleteDemo {
    public static void main (String [] args) throws Exception {
    Map map = new HashMap();
    map.put("kundera.nodes", "localhost");
    map.put("kundera.port", "9160");
    map.put("kundera.keyspace", "athlete");
    map.put("sessionless", "false");
    map.put("kundera.client", "com.impetus.kundera.client.

  7. EntityManager instances are created from a factory pattern.

    EntityManagerFactory factory = new EntityManagerFactoryImpl("t
    est", map);
    EntityManager manager = factory.createEntityManager();

  8. Use the find() method to look up by key. All annotated fields of the object are automatically populated with data from Cassandra.

    try {
    Athlete athlete = manager.find(Athlete.class, "bsmith");
    } catch (PersistenceException pe) {

How it works...

JPA provides methods such as find and remove. JPA takes the burden off the developer of writing mostly repetitive serialization code. While this does remove some of the burden, it also takes away some level of control. Since JPA can provide access to many types of data stores such as relational databases, it also makes it easy to switch between backend storage without having to make large code changes.

There's more...

Hector also offers a JPA solution in a subproject called hector-object-mapper.

Setting up Solandra for full text indexing with a Cassandra backend

Solandra is a combination of Lucene, Cassandra, and Solr. Lucene is a reverse index system designed for full text search. Solr is a popular frontend that provides a web service for Lucene as well as caching warming and other advanced capabilities. Solandra integrates with both tools by storing Lucene's data inside Cassandra, allowing for a high level of scalability.

How to do it...

  1. Use git to obtain a copy of the Solandra source code and use ant to build it.

    $ git clone https://github.com/tjake/Solandra.git
    $ ant

  2. Prepare a temporary directory that Solandra will use to store data. Then, run these steps to start Solandra, download, and load sample data.

    $ mkdir /tmp/cassandra-data
    $ cd solandra-app; ./start-solandra.sh -b
    $ cd ../reuters-demo/
    $ ./1-download-data.sh
    $ ./2-import-data.sh

    Apache Cassandra: Libraries and Applications

  3. Open ./website/index.html in your web browser. Place text in the search box to search for occurrences of it inside the documents loaded into Solandra.

How it works...

Solandra takes the data that Solr would normally store on local disk and instead stores it inside Cassandra. It does this by providing custom implementations of Lucene's IndexReader and IndexWriter and also runs Solr and Cassandra inside the same JVM. Solandra stores this data using OrderPreservingPartitioner because Lucene supports searching for ranges of terms (that is, albert to apple). Solandra provides a natural way to scale Solr. Applications can read data as soon it is written.

        Read more about this book      

(For more resources related to this subject, see here.)

Setting up Zookeeper to support Cages for transactional locking

Cages API is used for distributed read and write locks. The Cages API is built around Apache Zookeeper. This recipe shows how to set up a single instance of Zookeeper to support Cages.

Apache ZooKeeper is an effort to develop and maintain an open source server that enables highly reliable distributed coordination.

How to do it...

  1. Download a binary Apache Zookeeper release and extract it.

    $ http://apache.cyberuse.com//hadoop/zookeeper/zookeeper-3.3.2/
    $ tar -xf zookeeper-3.3.2.tar.gz
    cd zookeeper *

  2. Create a configuration file from the sample. Make sure to set the dataDir.

    $ cp conf/zoo_sample.cfg conf/zoo.cfg

  3. Create the dataDir directory you referenced in the preceding configuration.

    $ mkdir /tmp/zk

  4. Start the zookeeper instance.

    $ bin/zkServer.sh start
    JMX enabled by default
    Using config: /home/edward/cassandra-dev/zookeeper-3.3.2/bin/../
    Starting zookeeper ...

  5. Confirm Zookeeper is running by checking for a process listening on the defined client port 2181.

    $ netstat -an | grep 2181
    tcp 0 0 :::2181 :::*

How it works...

Apache Zookeeper provides applications with distributed synchronization. It is typically installed on one to seven nodes so it is highly available and capable of managing a large number of locks and watches. Cassandra and Zookeeper are an interesting pairing: Cassandra providing high availability and high performance with Zookeeper providing synchronization.

Using Cages to implement an atomic read and set

In the previous recipe, we set up Apache Zookeeper, a system for distributed synchronization. The Cages library provides a simple API to synchronize access to rows. This recipe shows how to use Cages.

Getting ready

To do this recipe, you must complete the previous recipe, Setting up Zookeeper to support Cages for transactional locking.

How to do it...

  1. Use subversion to checkout a copy of the Cages source code and binary JAR.

    $ svn checkout http://cages.googlecode.com/svn/trunk/cages-read-

  2. Copy the cages and zookeeper JARs to the library directory of the build root.

    $ cp cages-read-only/Cages/build/cages.jar <hpc_build>/lib/
    $ cp zookeeper-3.3.2/zookeeper-3.3.2.jar <hpc_build>/lib

  3. Adding imports for cages and zookeeper packages and classes to <hpc_build>/src/java/hpcas/c05/ShowConcurrency. java.

    import org.apache.cassandra.thrift.*;
    import org.wyki.zookeeper.cages.ZkSessionManager;
    import org.wyki.zookeeper.cages.ZkWriteLock;

  4. Next, add a reference to the ZkSessionManager, the object used to connect to Zookeeper.

    public class ShowConcurrency implements Runnable {
    ZkSessionManager session;
    String host;

  5. Initialize the session instance in the constructor.

    public ShowConcurrency(String host, int port, int inserts) {
    this.host = host;
    this.port = port;
    this.inserts = inserts;
    try {
    session = new ZkSessionManager("localhost");
    } catch (Exception ex) {
    System.out.println("could not connect to zookeeper "+ex);

  6. Zookeeper has a hierarchical data model. The keyspace represents the top directory and the column family represents the second level. The third level is the row key to be locked. After instantiating the lock object, use the acquire() method, perform the operations inside the critical section, and when done working with the lock, call release().

    for (int i = 0; i < inserts; i++) {
    ZkWriteLock lock = new ZkWriteLock("/ks33/cf33/count_col") ;
    try {
    int x = getValue(client);
    setValue(client, x);
    } finally {

  7. Run hpcas.c05.ShowConcurrency using four threads doing 30 inserts each.

    $ host= port=9160 inserts=30 threads=4 ant
    -DclassToRun=hpcas.c04.ShowConcurrency run
    [java] wrote 119
    [java] read 119
    [java] wrote 120
    [java] read 120
    [java] The final value is 120

How it works...

Cages and Zookeeper provide a way for external processes to synchronize. When each thread is initialized, it opens a Zookeeper session. The critical section of the code reads, increments, and finally updates a column. Surround the critical section of the code with a Zookeeper Write Lock that prevents all other threads from updating this value while the current thread operates on it.

There's more...

Synchronization incurs extra overhead; it should only be used when necessary. Zookeeper does scale out to several nodes, but it does not scale out indefinitely. This is because writes to Zookeeper have to be synchronized across all nodes.

Using Groovandra as a CLI alternative

Groovy is an agile and dynamic language for the Java Virtual Machine. Groovandra is a library designed to work with Groovy for rapid exploration of data in Cassandra. It can be used for tasks the Cassandra CLI cannot do and that coding and deploying a Java application may not make much sense. Code can be written line by line or in Groovy scripts that do not need to be compiled and packaged before running.

How to do it...

  1. Download a release of Groovy and extract it.

    $ wget http://dist.groovy.codehaus.org/distributions/groovy-
    $ unzip groovy-binary-1.8.0.zip

  2. Create a startup script that adds the JAR files in the cassandra/lib and the groovandra.jar to the classpath and then starts Groovy.

    $ vi groovy-1.8.0/bin/groovycassandraCASSANDRA_HOME=/home/edward/
    for i in ${CASSANDRA_HOME}/*.jar ; do
    export CLASSPATH
    $ chmod a+x groovy-1.8.0/bin/groovycassandra

  3. Start the Groovy shell.

    $ sh <groovy_home>/bin/groovycassandra
    bean=new com.jointhegrid.groovandra.GroovandraBean()
    ===> com.jointhegrid.groovandra.GroovandraBean@6a69ed4a
    groovy:000> bean.doConnect("localhost",9160);
    ===> com.jointhegrid.groovandra.GroovandraBean@6a69ed4a
    groovy:000> bean.withKeyspace("mail").showKeyspace()
    ===> KsDef(name:mail, strategy_class:org.apache.cassandra.locator.
    SimpleStrategy, replication_factor:1, ...

How it works...

Groovandra is a simple way to interact with Cassandra without having to go through the steps of compiling, deploying, and running Java applications. Groovy allows users to approach the application line by line. This allows ad hoc programming and debugging and is helpful for accessing the features of Cassandra that are not accessible from the CLI such as setting up a call to the multiget_slice method, which requires numerous parameters to be set.

Searchable log storage with Logsandra

Logsandra is a project based around log storage in Cassandra. Logsandra is a project that provides a set of tools to parse logs, store them in Cassandra in a searchable fashion, and search for or graph the occurrence of keywords in logs. Logsandra includes two processes. The first parses logs and stores them in Cassandra. The second runs a web server that allows you to search for occurrences of keywords in logs or graph their frequency.

Getting ready

Logsandra needs a running instance of Cassandra to connect to and store data. This recipe also requires Python and the Python installer pip.

$ yum install python python-pip

How to do it...

  1. Obtain a copy of the Logsandra source code using git and install Logsandra's dependencies using pip.

    $ git clone git://github.com/thobbs/logsandra.git
    $ cd logsandra

  2. Elevate to root to install the requirements and then drop back to a standard user.

    $ su
    # cat requirements.txt | xargs pip-python install
    # python setup.py install
    # exit

  3. Next, set up Logsandra's keyspace and load sample data.

    $ python scripts/create_keyspace.py
    $ python scripts/load_sample_data.py

    Loading sample data for the following keywords: foo, bar, baz

  4. Start the web server.

    $ ./logsandra-httpd.py start

  5. Open http://localhost:5000/ and search for 'foo', which was added by load_sample_data.py.

    Apache Cassandra: Libraries and Applications

    Logsandra presents a graph with occurrences of this keyword over time.

    Apache Cassandra: Libraries and Applications

How it works...

Logsandra creates and uses a keyspace name logsandra with a column family inside it named keyword. It primarily retrieves events by looking up all logs containing a keyword from a range of time. To make this efficient, the event timeline is denormalized to produce one timeline per keyword. For each keyword that appears in a log, a separate copy of the log event will be appended to the corresponding timeline. Each timeline gets its own row, and within the row, each column holds one log event. The columns are sorted chronologically, using unique IDs (UUIDs) for column names to avoid clashes. Although this denormalization strategy uses more space on disk, a lookup query by Logsandra will only read a single contiguous portion of one row in Cassandra, which is very efficient.

[default@logsandra] show keyspaces;
Keyspace: logsandra:

Column Families:
ColumnFamily: keyword
Columns sorted by: org.apache.cassandra.db.marshal.TimeUUIDType

Logsandra shows a versatile way to store and access log data in Cassandra. It is also important to note that Logsandra is written in Python, which demonstrates adoption for Cassandra outside the Java world.

There's more...

Inside the logsandra/conf directory, the logsandra.yaml file can be used to control which host and port the Logsandra web interface binds to, host and port information to connect to the Cassandra cluster, and directives that instruct it as to which folders to watch for log events.


This article introduces tools that make coding easier such as the high level client Hector, or the object mapping tool Kundera. Recipes also show how to setup and use applications built on top of Cassandra such as the full text search engine solandra.

Further resources on this subject:

Books to Consider

comments powered by Disqus

An Introduction to 3D Printing

Explore the future of manufacturing and design  - read our guide to 3d printing for free