Home

Cassandra High Performance Cookbook

By Edward Capriolo

Book

Subscription

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

Subscription

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Apache Cassandra is a fault-tolerant, distributed data store which offers linear scalability allowing it to be a storage platform for large high volume websites.

This book provides detailed recipes that describe how to use the features of Cassandra and improve its performance. Recipes cover topics ranging from setting up Cassandra for the first time to complex multiple data center installations. The recipe format presents the information in a concise actionable form.

The book describes in detail how features of Cassandra can be tuned and what the possible effects of tuning can be. Recipes include how to access data stored in Cassandra and use third party tools to help you out. The book also describes how to monitor and do capacity planning to ensure it is performing at a high level. Towards the end, it takes you through the use of libraries and third party applications with Cassandra and Cassandra integration with Hadoop.

Publication date:: July 2011
Publisher: Packt
Pages: 324
ISBN: 9781849515122

Chapter 1. Getting Started

In this chapter, you will learn the following recipes:

A simple single node Cassandra installation
Reading and writing test data using the command-line interface
Running multiple instances on a single machine
Scripting a multiple instance installation
Setting up a build and test environment for tasks in this book
Running the server in the foreground with full debugging
Calculating ideal Initial Tokens for use with Random Partitioner
Choosing Initial Tokens for use with Order Preserving Partitioners
Connecting to Cassandra with JConsole
Using JConsole to connect over a SOCKS proxy
Connecting to Cassandra with Java and Thrift

Introduction

The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together a fully distributed design and a ColumnFamily-based data model. The chapter contains recipes that allow users to hit the ground running with Cassandra. We show several recipes to set up Cassandra. These include cursory explanations of the key configuration files. It also contains recipes for connecting to Cassandra and executing commands both from the application programmer interface and the command-line interface. Also described are the Java profiling tools such as JConsole. The recipes in this chapter should help the user understand the basics of running and working with Cassandra.

A simple single node Cassandra installation

Cassandra is a highly scalable distributed database. While it is designed to run on multiple production class servers, it can be installed on desktop computers for functional testing and experimentation. This recipe shows how to set up a single instance of Cassandra.

Getting ready

Visit http://cassandra.apache.org in your web browser and find a link to the latest binary release. New releases happen often. For reference, this recipe will assume apache-cassandra-0.7.2-bin.tar.gz was the name of the downloaded file.

How to do it...

Download a binary version of Cassandra:

$ mkdir $home/downloads
$ cd $home/downloads
$ wget <url_from_getting_ready>/apache-cassandra-0.7.2-bin.tar.
z

Choose a base directory that the user will run as he has read and write access to:
Note
Default Cassandra storage locations
Cassandra defaults to wanting to save data in /var/lib/cassandra and logs in /var/log/cassandra. These locations will likely not exist and will require root-level privileges to create. To avoid permission issues, carry out the installation in user-writable directories.

Create a cassandra directory in your home directory. Inside the cassandra directory, create commitlog, log, saved_caches, and data subdirectories:

$ mkdir $HOME/cassandra/
$ mkdir $HOME/cassandra/{commitlog,log,data,saved_caches}
$ cd $HOME/cassandra/
$ cp $HOME/downloads/apache-cassandra-0.7.2-bin.tar.gz .
$ tar -xf apache-cassandra-0.7.2-bin.tar.gz

Use the echo command to display the path to your home directory. You will need this when editing the configuration file:
```
$ echo $HOME
/home/edward
```
This tar file extracts to apache-cassandra-0.7.2 directory. Open up the conf/cassandra.yaml file inside in your text editor and make changes to the following sections:
```
  data_file_directories:
  - /home/edward/cassandra/data
  commitlog_directory: /home/edward/cassandra/commit
  saved_caches_directory: /home/edward/cassandra/saved_caches
```
Edit the $HOME/apache-cassandra-0.7.2/conf/log4j-server.properties file to change the directory where logs are written:
```
log4j.appender.R.File=/home/edward/cassandra/log/system.log
```

Start the Cassandra instance and confirm it is running by connecting with nodetool:

$ $HOME/apache-cassandra-0.7.2/bin/cassandra

 INFO 17:59:26,699 Binding thrift service to /127.0.0.1:9160 
 INFO 17:59:26,702 Using TFramedTransport with a max frame size of 15728640 bytes.
$ $HOME/apache-cassandra-0.7.2/bin/nodetool --host 127.0.0.1 ring
Address     Status   State   Load            Token                                       
127.0.0.1      Up     Normal  385 bytes       398856952452...

How it works...

Cassandra comes as a compiled Java application in a tar file. By default, it is configured to store data inside /var. By changing options in the cassandra.yaml configuration file, Cassandra uses specific directories created.

Note

YAML: YAML Ain't Markup Language

YAML™ (rhymes with "camel") is a human-friendly, cross-language, Unicode-based data serialization language designed around the common native data types of agile programming languages. It is broadly useful for programming needs ranging from configuration files and Internet messaging to object persistence and data auditing.

See http://www.yaml.org for more information.

After startup, Cassandra detaches from the console and runs as a daemon. It opens several ports, including the Thrift port 9160 and JMX port on 8080. For versions of Cassandra higher than 0.8.X, the default port is 7199. The nodetool program communicates with the JMX port to confirm that the server is alive.

There's more...

Due to the distributed design, many of the features require multiple instances of Cassandra running to utilize. For example, you cannot experiment with Replication Factor, the setting that controls how many nodes data is stored on, larger than one. Replication Factor dictates what Consistency Level settings can be used for. With one node the highest Consistency Level is ONE.

Reading and writing test data using the command-line interface

The command-line interface (CLI) presents users with an interactive tool to communicate with the Cassandra server and execute the same operations that can be done from client server code. This recipe takes you through all the steps required to insert and read data.

How to do it...

Start the Cassandra CLI and connect to an instance:

$ <cassandra_home>/bin/cassandra-cli
[default@unknown] connect 127.0.0.1/9160;
Connected to: "Test Cluster" on 127.0.0.1/9160

New clusters do not have any preexisting keyspaces or column families. These need to be created so data can be stored in them:

[default@unknown] create keyspace testkeyspace
[default@testkeyspace] use testkeyspace;
Authenticated to keyspace: testkeyspace
[default@testkeyspace] create column family testcolumnfamily;

Insert and read back data using the set and get commands:

[default@testk..] set testcolumnfamily['thekey']['thecolumn']='avalue';       
Value inserted.
[default@testkeyspace] assume testcolumnfamily validator as ascii; 
[default@testkeyspace] assume testcolumnfamily comparator as ascii; 
[default@testkeyspace] get testcolumnfamily['thekey'];            
=> (column=thecolumn, value=avalue, timestamp=1298580528208000)

How it works...

The CLI is a helpful interactive facade on top of the Cassandra API. After connecting, users can carry out administrative or troubleshooting tasks.

Running multiple instances on a single machine

Cassandra is typically deployed on clusters of multiple servers. While it can be run on a single node, simulating a production cluster of multiple nodes is best done by running multiple instances of Cassandra. This recipe is similar to A simple single node Cassandra installation earlier in this chapter. However in order to run multiple instances on a single machine, we create different sets of directories and modified configuration files for each node.

How to do it...

Ensure your system has proper loopback address support. Each system should have the entire range of 127.0.0.1-127.255.255.255 configured as localhost for loopback. Confirm this by pinging 127.0.0.1 and 127.0.0.2:

$ ping -c 1 127.0.0.1 
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 
64 bytes from 127.0.0.1: icmp_req=1 ttl=64 time=0.051 ms 
$ ping -c 1 127.0.0.2 
PING 127.0.0.2 (127.0.0.2) 56(84) bytes of data. 
64 bytes from 127.0.0.2: icmp_req=1 ttl=64 time=0.083 ms

Use the echo command to display the path to your home directory. You will need this when editing the configuration file:
```
$ echo $HOME
/home/edward
```

Create a hpcas directory in your home directory. Inside the cassandra directory, create commitlog, log, saved_caches, and data subdirectories:

$ mkdir $HOME/hpcas/
$ mkdir $HOME/hpcas/{commitlog,log,data,saved_caches}
$ cd $HOME/hpcas/
$ cp $HOME/downloads/apache-cassandra-0.7.2-bin.tar.gz .

$ tar -xf apache-cassandra-0.7.2-bin.tar.gz

Download and extract a binary distribution of Cassandra. After extracting the binary, move/rename the directory by appending '1' to the end of the filename.$ mv apache-cassandra-0.7.2 apache-cassandra-0.7.2-1 Open the apache-cassandra-0.7.2-1/conf/cassandra.yaml in a text editor. Change the default storage locations and IP addresses to accommodate our multiple instances on the same machine without clashing with each other:
```
  data_file_directories:
      - /home/edward/hpcas/data/1
  commitlog_directory: /home/edward/hpcas/commitlog/1
     saved_caches_directory: /home/edward/hpcas/saved_caches/1
  listen_address: 127.0.0.1
  rpc_address: 127.0.0.1
```
Each instance will have a separate logfile. This will aid in troubleshooting. Edit conf/log4j-server.properties:
```
log4j.appender.R.File=/home/edward/hpcas/log/system1.log
```
Cassandra uses JMX (Java Management Extensions), which allows you to configure an explicit port but always binds to all interfaces on the system. As a result, each instance will require its own management port. Edit cassandra-env.sh:
```
JMX_PORT=8001
```

Start this instance:

$ ~/hpcas/apache-cassandra-0.7.2-1/bin/cassandra

 INFO 17:59:26,699 Binding thrift service to /127.0.0.101:9160 
 INFO 17:59:26,702 Using TFramedTransport with a max frame size of 15728640 bytes. 
$ bin/nodetool --host 127.0.0.1 --port 8001 ring

 Address       Status State   Load            Token                
127.0.0.1      Up     Normal  385 bytes       398856952452...

At this point your cluster is comprised of single node. To join other nodes to the cluster, carry out the preceding steps replacing '1' with '2', '3', '4', and so on:

$ mv apache-cassandra-0.7.2 apache-cassandra-0.7.2-2

Open ~/hpcas/apache-cassandra-0.7.2-2/conf/cassandra.yaml in a text editor:

  data_file_directories:
      - /home/edward/hpcas/data/2
  commitlog_directory: /home/edward/hpcas/commitlog/2
     saved_caches_directory: /home/edward/hpcas/saved_caches/2
  listen_address: 127.0.0.2
  rpc_address: 127.0.0.2

Edit ~/hpcas/apache-cassandra-0.7.2-2/conf/log4j-server.properties:
```
log4j.appender.R.File=/home/edward/hpcas/log/system2.log
```
Edit ~/hpcas/apache-cassandra-0.7.2-2/conf/cassandra-env.sh:
```
JMX_PORT=8002
```

Start this instance:

$ ~/hpcas/apache-cassandra-0.7.2-2/bin/cassandra

How it works...

The Thrift port has to be the same for all instances in a cluster. Thus, it is impossible to run multiple nodes in the same cluster on one IP address. However, computers have multiple loopback addresses: 127.0.0.1, 127.0.0.2, and so on. These addresses do not usually need to be configured explicitly. Each instance also needs its own storage directories. Following this recipe you can run as many instances on your computer as you wish, or even multiple distinct clusters. You are only limited by resources such as memory, CPU time, and hard disk space.

Scripting a multiple instance installation

Cassandra is an active open source project. Setting up a multiple-node test environment is not complex, but has several steps and smaller errors happen. Each time you wish to try a new release, the installation process will have to be repeated. This recipe achieves the same result of the Running multiple instances on a single machine recipe, but only involves running a single script.

How to do it...

Create a shell script hpcbuild/scripts/ch1/multiple_instances.sh with this content:

#!/bin/sh
CASSANDRA_TAR=apache-cassandra-0.7.3-bin.tar.gz
TAR_EXTRACTS_TO=apache-cassandra-0.7.3
HIGH_PERF_CAS=${HOME}/hpcas
mkdir ${HIGH_PERF_CAS}
mkdir ${HIGH_PERF_CAS}/commit/
mkdir ${HIGH_PERF_CAS}/data/mkdir ${HIGH_PERF_CAS}/saved_caches/

Copy the tar to the base directory and then use pushd to change to that directory. The body of this script runs five times:

cp  ${CASSANDRA_TAR} ${HIGH_PERF_CAS}
pushd ${HIGH_PERF_CAS}
for i in 1 2 3 4 5 ; do
  tar -xf ${CASSANDRA_TAR}
  mv ${TAR_EXTRACTS_TO} ${TAR_EXTRACTS_TO}-${i}

Cassandra attempts to auto detect your memory settings based on your system memory. When running multiple instances on a single machine, the memory settings need to be lower:

   sed -i '1 i MAX_HEAP_SIZE="256M"' ${TAR_EXTRACTS_TO}-${i}/conf/cassandra-env.sh 
  sed -i '1 i HEAP_NEWSIZE="100M"' ${TAR_EXTRACTS_TO}-${i}/conf/cassandra-env.sh

Replace listen_address and rpc_address with a specific IP, but do not change the seed from 127.0.0.1:

  sed -i "/listen_address\|rpc_address/s/localhost/127.0.0.${i}/g" ${TAR_EXTRACTS_TO}-${i}/conf/cassandra.yaml

Set the data, commit log, and saved_caches directory for this instance:

  sed -i "s|/var/lib/cassandra/data|${HIGH_PERF_CAS}/data/${i}|g" ${TAR_EXTRACTS_TO}-${i}/conf/cassandra.yaml
  sed -i "s|/var/lib/cassandra/commitlog|${HIGH_PERF_CAS}/commit/${i}|g" ${TAR_EXTRACTS_TO}-${i}/conf/cassandra.yaml
  sed -i "s|/var/lib/cassandra/saved_caches|${HIGH_PERF_CAS}/saved_caches/${i}|g" ${TAR_EXTRACTS_TO}-${i}/conf/cassandra.yaml

Change the JMX port for each instance:

  sed -i "s|8080|800${i}|g" ${TAR_EXTRACTS_TO}-${i}/conf/cassandra-env.sh
done
popd

Change the mode of the script to executable and run it:

$ chmod a+x multiple_instances.sh
$ ./multiple_instances.sh

How it works...

This script accomplishes the same tasks as the recipe. This script uses borne shell scripting to handle tasks such as creating directories and extracting tars, and uses the sed utility to locate sections of the file that need to be modified to correspond to the directories created.

Setting up a build and test environment for tasks in this book

Cassandra does not have a standardized data access language such as SQL or XPATH. Access to Cassandra is done through the Application Programmer Interface (API). Cassandra has support for Thrift, which generates bindings for a variety of languages. Since Cassandra is written in Java, these bindings are well established, part of the Cassandra distribution, and stable. Thus, it makes sense to have a build environment capable of compiling and running Java applications to access Cassandra. This recipe shows you how to set up this environment. Other recipes in the book that involve coding will assume you have this environment setup.

Getting ready

You will need:

The apache-ant build tool (http://ant.apache.org)
Java SDK (http://www.oracle.com/technetwork/java/index.html)
JUnit jar (http://www.junit.org/)

How to do it...

Create a top-level folder and several sub folders for this project:

$ mkdir ~/hpcbuild
   $ cd ~/hpcbuild
   $ mkdir src/{java,test}
   $ mkdir lib

Copy JAR files from your Cassandra distribution into the lib directory:
```
$ cp <cassandra-home>/lib/*.jar ~/hpcbuild/lib 
```
From the JUnit installation, copy the junit.jar into your library path. Java applications can use JUnit tests for better code coverage:
```
$ cp <junit-home>/junit*.jar ~/hpcbuild/lib 
```

Create a build.xml file for use with Ant. A build.xml file is similar to a Makefile . By convention, properties that represent critical paths to the build are typically specified at the top of the file:

<project name="hpcas" default="dist" basedir="."> 
  <property name="src" location="src/java"/> 
  <property name="test.src" location="src/test"/> 
  <property name="build" location="build"/> 
  <property name="build.classes" location="build/classes"/> 
  <property name="test.build" location="build/test"/> 
  <property name="dist"  location="dist"/> 
  <property name="lib" location="lib"/>

Ant has tags that help build paths. This is useful for a project that requires multiple JAR files in its classpath to run:

  <path id="hpcas.classpath"> 
    <pathelement location="${build.classes}"/> 
    <fileset dir="${lib}" includes="*.jar"/> 
  </path>

We want to exclude test cases classes from the final JAR we produce. Create a separate source and build path for the test cases:

  <path id="hpcas.test.classpath"> 
    <pathelement location="${test.build}"/> 
    <path refid="hpcas.classpath"/> 
  </path>

An Ant target does a unit of work such as compile or run. The init target creates directories that are used in other parts of the build:

  <target name="init"> 
    <mkdir dir="${build}"/> 
    <mkdir dir="${build.classes}"/> 
    <mkdir dir="${test.build}"/> 
  </target>

The compile target builds your code using the javac compiler. If you have any syntax errors, they will be reported at this stage:

  <target name="compile" depends="init"> 
    <javac srcdir="${src}" destdir="${build.classes}"> 
       <classpath refid="hpcas.classpath"/> 
    </javac> 
  </target> 
  <target name="compile-test" depends="init"> 
    <javac srcdir="${test.src}" destdir="${test.build}"> 
      <classpath refid="hpcas.test.classpath"/> 
    </javac> 
  </target>

The test target looks for filenames that match certain naming conventions and executes them as a batch of JUnit tests. In this case, the convention is any file that starts with Test and ends in .class:

  <target name="test" depends="compile-test,compile" > 
    <junit printsummary="yes" showoutput="true" > 
      <classpath refid="hpcas.test.classpath" /> 
      <batchtest> 
        <fileset dir="${test.build}" includes="**/Test*.class" /> 
      </batchtest> 
    </junit> 
  </target>

If the build step succeeds, the dist target creates a final JAR hpcas.jar:

  <target name="dist" depends="compile" > 
    <mkdir dir="${dist}/lib"/> 
    <jar jarfile="${dist}/lib/hpcas.jar" basedir="${build.classes}"/> 
  </target>

The run target will allow us to execute classes we build:

 <target name="run" depends="dist"> 
    <java classname="${classToRun}" > 
      <classpath refid="hpcas.classpath"/> 
    </java> 
  </target>

The clean target is used to remove files left behind from older builds:

  <target name="clean" > 
    <delete dir="${build}"/> 
    <delete dir="${dist}"/> 
  </target> 
</project>

Now that the build.xml file is constructed, we must verify it works as expected. Create small Java applications in both the build and test source paths. The first is a JUnit test in src/test/Test.java:

import junit.framework.*; 
public class Test extends TestCase { 
    public void test() { 
        assertEquals( "Equality Test", 0, 0 ); 
    }
}

Next, write a simple "yo cassandra" program hpcbuild/src/java/A.java:

public class A { 
  public static void main(String [] args){ 
    System.out.println("yo cassandra"); 
  }
}

Call the test target:

$ ant test

Buildfile: /home/edward/hpcbuild/build.xml 
...
    [junit] Running Test 
    [junit] Tests run: 1, Failures: 0, Errors: 0, 
Time elapsed: 0.012 sec 
BUILD SUCCESSFUL 
Total time: 5 seconds

Call the dist target. This will compile source code and build a JAR file:
```
$ ant dist 
compile: 
dist: 
      [jar] Building jar: /home/edward/hpcbuild/dist/lib/hpcas.jar 
BUILD SUCCESSFUL 
Total time: 3 seconds 
```
The jar command will build empty JAR files with no indication that you had specified the wrong path. You can use the -tf arguments to verify that the JAR file holds the content you believe it should:
```
$ jar -tf /home/edward/hpcbuild/dist/lib/hpcas.jar 
```
```
META-INF/ 
META-INF/MANIFEST.MF 
A.class 
```

Use the run target to run the A class:

$ ant -DclassToRun=A run

run: 
     [java] yo cassandra
BUILD SUCCESSFUL 
Total time: 2 seconds

How it works...

Ant is a build system popular with Java projects. An Ant script has one or more targets. A target can be a task such as compiling code, testing code, or producing a final JAR. Targets can depend on other targets. As a result, you do not have to run a list of targets sequentially; the dist target will run its dependents such as compile and init and their dependencies in proper order.

There's more...

If you want to work with an IDE, the NetBeans IDE has a type of project called Free-Form project . You can use the preceding build.xml with the Free-Form project type.

Running in the foreground with full debugging

When working with new software or troubleshooting an issue, every piece of information can be valuable. Cassandra has the capability to both run in the foreground and to run with specific debugging levels. This recipe will show you how to run in the foreground with the highest possible debugging level.

How to do it...

Edit conf/log4j-server.properties:
```
log4j.rootLogger=DEBUG,stdout,R 
```
Start the instance in the foreground using –f:
```
$ bin/cassandra -f
```

How it works...

Without the -f option, Cassandra disassociates itself from the starting console and runs like a system daemon. With the -f option, Cassandra runs as a standard Java application.

Log4J has a concept of log levels DEBUG, INFO, WARN, ERROR, and FATAL. Cassandra normally runs at the INFO level.

There's more...

Setting a global DEBUG level is only appropriate for testing and troubleshooting because of the overhead incurred by writing many events to a single file. If you have to enable debug in production, try to do it for the smallest set of classes possible, not all org.apache.cassandra classes.

Calculating ideal Initial Tokens for use with Random Partitioner

Cassandra uses a Consistent Hashing to divide data across the ring. Each node has an Initial Token which represents the node's logical position in the ring. Initial Tokens should divide the Keyspace evenly. Using the row key of data, the partitioner calculates a token. The node whose Initial Token is closest without being larger than the data's token is where the data is stored along with the other replicas.

Initial Tokens decide who is "responsible for" data.

The formula to calculate the ideal Initial Tokens is:

Initial_Token= Zero_Indexed_Node_Number  * ((2^127) / Number_Of_Nodes)

For a five node cluster, the initial token for the 3rd node would be:

initial token=2 * ((2^127) / 5) 
initial token=68056473384187692692674921486353642290

Initial Tokens can be very large numbers. For larger clusters of 20 or more nodes, determining the ideal Initial Token for each node in a cluster is a time consuming process. The following Java program calculates the Initial Tokens for each node in the cluster.

Getting ready

You can easily build and run this example following Setting up a build and test environment earlier in this chapter.

How to do it...

Create a file src/hpcas/c01/InitialTokes.java:

package hpcas.c01;
import java.math.*;
public class InitialTokens {
  public static void main (String [] args){
    if (System.getenv("tokens")==null){
      System.err.println("Usage: tokens=5 ant -DclassToRun=InitialTokens run");
      System.exit(0);  
    }
    int nodes = Integer.parseInt(System.getenv("tokens"));
    for (int i = 0 ;i <nodes;i++){
      BigInteger hs = new BigInteger("2");
      BigInteger res = hs.pow( 127 );
      BigInteger div = res.divide( new BigInteger( nodes+"") );
      BigInteger fin = div.multiply( new BigInteger(i+"") );
      System.out.println(fin);
    }
  }
}

Set the environment variable tokens to the number of nodes in the cluster. Then, call the run target, passing the full class name hpcas.c01.InitialTokens as a Java property:

$ tokens=5 ant -DclassToRun=hpcas.c01.InitialTokens run

run:
     [java] 0
     [java] 34028236692093846346337460743176821145
     [java] 68056473384187692692674921486353642290
     [java] 102084710076281539039012382229530463435
     [java] 136112946768375385385349842972707284580

How it works

Generating numbers equidistant from each other helps keep the amount of data on each node in the cluster balanced. This also keeps the requests per nodes balanced. When initializing systems running the server for the first time, use these numbers in the initial_tokens field of the conf/cassandra.yaml file.

There's more...

This technique for calculating Initial Tokens is ideal for the Random Partitioner, which is the default partitioner. When using the Order Preserving Partitioner, imbalances in key distribution may require adjustments to the Initial Tokens to balance out the load.

Choosing Initial Tokens for use with Partitioners that preserve ordering

Some partitioners in Cassandra preserve the ordering of keys. Examples of these partitioners include ByteOrderedPartitioner and OrderPreservingPartitioner. If the distribution of keys is uneven, some nodes will have more data than others. This recipe shows how to choose initial_tokens for a phonebook dataset while using OrderPreservingPartitioner.

How to do it...

In the conf/cassandra.yaml file, set the partitioner attribute.

org.apache.cassandra.dht.OrderPreservingPartitioner

Determine the approximate distribution of your keys. For names from a phonebook, some letters may be more common than others. Names such as Smith are very common while names such as Capriolo are very rare. For a cluster of eight nodes, choose initial tokens that will divide the list roughly evenly.

A, Ek, J, Mf, Nh, Sf, Su, Tf

Tip

Calculating Distributions

Information on calculating distributions using spreadsheets can be found online: http://www.wisc-online.com/objects/ViewObject.aspx?ID=TMH4604.

How it works...

Partitioners that preserve order can range scan across keys and return data in a natural order. The trade off is that users and administrators have to plan for and track the distribution of data.

There's more...

If a Cassandra node has already joined the cluster, see the recipe in Chapter 7, Administration, the recipe Nodetool Move: Move a node to a specific ring location to see how to move a node to an initial token.

Insight into Cassandra with JConsole

The Java Virtual Machine has an integrated system to do interactive monitoring of JVM internals called JVM (Java Management Extensions). In addition to JVM internals, applications can maintain their own counters and provide operations that the user can trigger remotely. Cassandra has numerous counters and the ability to trigger operations such as clearing the Key Cache or disabling compaction over JMX. This recipe shows how to connect to Cassandra instances using JConsole .

Getting ready

JConsole comes with the Java Runtime Environment. It requires a windowing system such as X11 to run on the system you start JConsole from, not on the server it will connect to.

How to do it...

Start JConsole:
```
$ /usr/java/latest/bin/jconsole
```
In the Remote Process box, enter the host and port of your instance:
Click on the Memory tab to view information about the virtual memory being used by the JVM:

How it works...

JConsole can connect to local processes running as your user without host and port information by selecting the process in the Local Process list. Connecting to processes on other machines requires you to enter host and port information in the Remote Process.

Connecting with JConsole over a SOCKS proxy

Often, you would like to run JConsole on your desktop and connect to a server on a remote network. JMX uses Remote Method Invocation (RMI) to communicate between systems. RMI has an initial connection port. However, the server allocates dynamic ports for further communication. Applications that use RMI typically have trouble running on more secure networks. This recipe shows how to create a dynamic proxy over SSH and how to have JConsole use the proxy instead of direct connections.

Getting ready

On your management system you will need an SSH client from OpenSSH . This comes standard with almost any Unix/Linux system. Windows users can try Cygwin to get an OpenSSH client.

How to do it...

Start an SSH tunnel to your login server, for example login.domain.com. The -D option allocates the SOCKS proxy:
```
$ ssh -f -D9998 edward@login.domain.com 'while true; do sleep 1; done'
```

Start up JConsole by passing it command-line instructions to use the proxy you created in the last step:

$ jconsole -J-DsocksProxyHost=localhost -J-DsocksProxyPort=9998 \ service:jmx:rmi:///jndi/rmi://cas1.domain.com:8080/jmxrmi

How it works...

A dynamic SOCKS proxy is opened up on the target server and tunneled to a local port on your workstation. JConsole is started up and configured to use this proxy. When JConsole attempts to open connections, they will happen through the proxy. Destination hosts will see the source of the traffic as your proxy system and not as your local desktop.

Connecting to Cassandra with Java and Thrift

Cassandra clients communicate with servers through API classes generated by Thrift. The API allows clients to perform data manipulation operations as well as gain information about the cluster. This recipe shows how to connect from client to server and call methods that return cluster information.

Getting ready

This recipe is designed to work with the build environment from the recipe Setting up a build and test environment. You also need to have a system running Cassandra, as in the Simulating multiple node clusters recipe.

How to do it...

Create a file src/hpcas/c01/ShowKeyspaces.java:

package hpcas.c01;
import org.apache.cassandra.thrift.*;
import org.apache.thrift.protocol.*;
import org.apache.thrift.transport.*;
public class ShowKeyspaces {
  public static void main(String[] args) throws Exception {
    String host = System.getenv("host");
    int port = Integer.parseInt(System.getenv("port"));

The objective is to create a Cassandra.Client instance that can communicate with Cassandra. The Thrift framework requires several steps to instantiate:

    TSocket socket = new TSocket(host, port);
    TTransport transport = new TFramedTransport(socket);
    TProtocol proto = new TBinaryProtocol(transport);
    transport.open();
    Cassandra.Client client = new Cassandra.Client(proto);

We call methods from the Cassandra.Client that allow the user to inspect the server, such as describing the cluster name and the version:

    System.out.println("version "+client.describe_version());
    System.out.println("partitioner"
    +client.describe_partitioner());
    System.out.println("cluster name "   
    +client.describe_cluster_name());
    for ( String keyspace: client.describe_keyspaces() ){
        System.out.println("keyspace " +keyspace);
    }
    transport.close();
  }
}

Run this application by providing host and port environment variables:

# host=127.0.0.1 port=9160 ant -DclassToRun=hpcas.c01.ShowKeyspaces run 
run: 
     [java] version 10.0.0 
     [java] partitioner org.apache.cassandra.dht.RandomPartitioner 
     [java] cluster name Test Cluster 
     [java] keyspace Keyspace1 
     [java] keyspace system

How it works...

Cassandra clusters are symmetric in that you can connect to any node in the cluster and perform operations. Thrift has a multi-step connection process. After choosing the correct transports and other connection settings, users can instantiate a Cassandra.Client instance. With an instance of the Cassandra.Client, users can call multiple methods without having to reconnect. We called some methods such as describe_cluster_name() that show some information about the cluster and then disconnect.

Cassandra High Performance Cookbook

Chapter 1. Getting Started

Introduction

A simple single node Cassandra installation

Getting ready

How to do it...

Note

How it works...

Note

There's more...

See also...

Reading and writing test data using the command-line interface

How to do it...

How it works...

See also...

Running multiple instances on a single machine

How to do it...

How it works...

See also...

Scripting a multiple instance installation

How to do it...

How it works...

Setting up a build and test environment for tasks in this book

Getting ready

How to do it...

How it works...

There's more...

Running in the foreground with full debugging

How to do it...

How it works...

There's more...

Calculating ideal Initial Tokens for use with Random Partitioner

Getting ready

How to do it...

How it works

There's more...

See also...

Choosing Initial Tokens for use with Partitioners that preserve ordering

How to do it...

Tip

How it works...

There's more...

Insight into Cassandra with JConsole

Getting ready

How to do it...

How it works...

See also...

Connecting with JConsole over a SOCKS proxy

Getting ready

How to do it...

How it works...

Connecting to Cassandra with Java and Thrift

Getting ready

How to do it...

How it works...

See also...