





















































The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together a fully distributed design and a ColumnFamily-based data model. The article contains recipes that allow users to hit the ground running with Cassandra. We show several recipes to set up Cassandra. These include cursory explanations of the key configuration files. It also contains recipes for connecting to Cassandra and executing commands both from the application programmer interface and the command-line interface. Also described are the Java profiling tools such as JConsole. The recipes in this article should help the user understand the basics of running and working with Cassandra.
Cassandra is a highly scalable distributed database. While it is designed to run on multiple production class servers, it can be installed on desktop computers for functional testing and experimentation. This recipe shows how to set up a single instance of Cassandra.
Visit http://cassandra.apache.org in your web browser and find a link to the latest binary release. New releases happen often. For reference, this recipe will assume apache-cassandra-0.7.2-bin.tar.gz was the name of the downloaded file.
$ mkdir $home/downloads
$ cd $home/downloads
$ wget <url_from_getting_ready>/apache-cassandra-0.7.2-bin.tar.gz
Default Cassandra storage locations
Cassandra defaults to wanting to save data in /var/lib/cassandra and logs in /var/log/cassandra. These locations will likely not exist and will require root-level privileges to create. To avoid permission issues, carry out the installation in user-writable directories.
$ mkdir $HOME/cassandra/
$ mkdir $HOME/cassandra/{commitlog,log,data,saved_caches}
$ cd $HOME/cassandra/
$ cp $HOME/downloads/apache-cassandra-0.7.2-bin.tar.gz .
$ tar -xf apache-cassandra-0.7.2-bin.tar.gz
$ echo $HOME
/home/edward
This tar file extracts to apache-cassandra-0.7.2 directory. Open up the conf/cassandra.yaml file inside in your text editor and make changes to the following sections:
data_file_directories:
- /home/edward/cassandra/data
commitlog_directory: /home/edward/cassandra/commit
saved_caches_directory: /home/edward/cassandra/saved_caches
log4j.appender.R.File=/home/edward/cassandra/log/system.log
$ $HOME/apache-cassandra-0.7.2/bin/cassandra
INFO 17:59:26,699 Binding thrift service to /127.0.0.1:9160
INFO 17:59:26,702 Using TFramedTransport with a max frame size of
15728640 bytes.
$ $HOME/apache-cassandra-0.7.2/bin/nodetool --host 127.0.0.1 ring
Address Status State Load Token
127.0.0.1 Up Normal 385 bytes 398856952452...
Cassandra comes as a compiled Java application in a tar file. By default, it is configured to store data inside /var. By changing options in the cassandra.yaml configuration file, Cassandra uses specific directories created.
YAML: YAML Ain't Markup Language
YAML™ (rhymes with "camel") is a human-friendly, cross-language, Unicode-based data serialization language designed around the common native data types of agile programming languages. It is broadly useful for programming needs ranging from configuration files and Internet messaging to object persistence and data auditing. See http://www.yaml.org for more information.
After startup, Cassandra detaches from the console and runs as a daemon. It opens several ports, including the Thrift port 9160 and JMX port on 8080. For versions of Cassandra higher than 0.8.X, the default port is 7199. The nodetool program communicates with the JMX port to confirm that the server is alive.
Due to the distributed design, many of the features require multiple instances of Cassandra running to utilize. For example, you cannot experiment with Replication Factor, the setting that controls how many nodes data is stored on, larger than one. Replication Factor dictates what Consistency Level settings can be used for. With one node the highest Consistency Level is ONE.
The command-line interface (CLI) presents users with an interactive tool to communicate with the Cassandra server and execute the same operations that can be done from client server code. This recipe takes you through all the steps required to insert and read data.
$ <cassandra_home>/bin/cassandra-cli
[default@unknown] connect 127.0.0.1/9160;
Connected to: "Test Cluster" on 127.0.0.1/9160
[default@unknown] create keyspace testkeyspace
[default@testkeyspace] use testkeyspace;
Authenticated to keyspace: testkeyspace
[default@testkeyspace] create column family testcolumnfamily;
[default@testk..] set testcolumnfamily['thekey']
['thecolumn']='avalue';
Value inserted.
[default@testkeyspace] assume testcolumnfamily validator as
ascii;
[default@testkeyspace] assume testcolumnfamily comparator as
ascii;
[default@testkeyspace] get testcolumnfamily['thekey'];
=> (column=thecolumn, value=avalue, timestamp=1298580528208000)
The CLI is a helpful interactive facade on top of the Cassandra API. After connecting, users can carry out administrative or troubleshooting tasks.
Cassandra is typically deployed on clusters of multiple servers. While it can be run on a single node, simulating a production cluster of multiple nodes is best done by running multiple instances of Cassandra. This recipe is similar to A simple single node Cassandra installation earlier in this article. However in order to run multiple instances on a single machine, we create different sets of directories and modified configuration files for each node.
$ ping -c 1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_req=1 ttl=64 time=0.051 ms
$ ping -c 1 127.0.0.2
PING 127.0.0.2 (127.0.0.2) 56(84) bytes of data.
64 bytes from 127.0.0.2: icmp_req=1 ttl=64 time=0.083 ms
$ echo $HOME
/home/edward
$ mkdir $HOME/hpcas/
$ mkdir $HOME/hpcas/{commitlog,log,data,saved_caches}
$ cd $HOME/hpcas/
$ cp $HOME/downloads/apache-cassandra-0.7.2-bin.tar.gz .
$ tar -xf apache-cassandra-0.7.2-bin.tar.gz
data_file_directories:
- /home/edward/hpcas/data/1
commitlog_directory: /home/edward/hpcas/commitlog/1
saved_caches_directory: /home/edward/hpcas/saved_caches/1
listen_address: 127.0.0.1
rpc_address: 127.0.0.1
Each instance will have a separate logfile. This will aid in troubleshooting. Edit conf/log4j-server.properties:
log4j.appender.R.File=/home/edward/hpcas/log/system1.log
Cassandra uses JMX (Java Management Extensions), which allows you to configure an explicit port but always binds to all interfaces on the system. As a result, each instance will require its own management port. Edit cassandra-env.sh:
JMX_PORT=8001
$ ~/hpcas/apache-cassandra-0.7.2-1/bin/cassandra
INFO 17:59:26,699 Binding thrift service to /127.0.0.101:9160
INFO 17:59:26,702 Using TFramedTransport with a max frame size of
15728640 bytes.
$ bin/nodetool --host 127.0.0.1 --port 8001 ring
Address Status State Load Token
127.0.0.1 Up Normal 385 bytes 398856952452...
At this point your cluster is comprised of single node. To join other nodes to the cluster, carry out the preceding steps replacing '1' with '2', '3', '4', and so on:
$ mv apache-cassandra-0.7.2 apache-cassandra-0.7.2-2
data_file_directories:
- /home/edward/hpcas/data/2
commitlog_directory: /home/edward/hpcas/commitlog/2
saved_caches_directory: /home/edward/hpcas/saved_caches/2
listen_address: 127.0.0.2
rpc_address: 127.0.0.2
log4j.appender.R.File=/home/edward/hpcas/log/system2.log
JMX_PORT=8002
$ ~/hpcas/apache-cassandra-0.7.2-2/bin/cassandra
The Thrift port has to be the same for all instances in a cluster. Thus, it is impossible to run multiple nodes in the same cluster on one IP address. However, computers have multiple loopback addresses: 127.0.0.1, 127.0.0.2, and so on. These addresses do not usually need to be configured explicitly. Each instance also needs its own storage directories. Following this recipe you can run as many instances on your computer as you wish, or even multiple distinct clusters. You are only limited by resources such as memory, CPU time, and hard disk space.