Cassandra High Performance Cookbook

Cassandra High Performance Cookbook
eBook: $26.99
Formats: PDF, PacktLib, ePub and Mobi formats
save 15%!
Print + free eBook + free PacktLib access to the book: $71.98    Print cover: $44.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Table of Contents
Sample Chapters
  • Get the best out of Cassandra using this efficient recipe bank
  • Configure and tune Cassandra components to enhance performance
  • Deploy Cassandra in various environments and monitor its performance
  • Well illustrated, step-by-step recipes to make all tasks look easy!

Book Details

Language : English
Paperback : 310 pages [ 235mm x 191mm ]
Release Date : July 2011
ISBN : 1849515123
ISBN 13 : 9781849515122
Author(s) : Edward Capriolo
Topics and Technologies : All Books, Big Data and Business Intelligence, Data, Cookbooks, Open Source

Table of Contents

Chapter 1: Getting Started
Chapter 2: The Command-line Interface
Chapter 3: Application Programmer Interface
Chapter 4: Performance Tuning
Chapter 5: Consistency, Availability, and Partition Tolerance with Cassandra
Chapter 6: Schema Design
Chapter 7: Administration
Chapter 8: Multiple Datacenter Deployments
Chapter 9: Coding and Internals
Chapter 10: Libraries and Applications
Chapter 11: Hadoop and Cassandra
Chapter 12: Collecting and Analyzing Performance Statistics
Chapter 13: Monitoring Cassandra Servers
  • Chapter 1: Getting Started
    • Introduction
    • A simple single node Cassandra installation
    • Reading and writing test data using the command-line interface
    • Running multiple instances on a single machine
    • Scripting a multiple instance installation
    • Setting up a build and test environment for tasks in this book
    • Running in the foreground with full debugging
    • Calculating ideal Initial Tokens for use with Random Partitioner
    • Choosing Initial Tokens for use with Partitioners that preserve ordering
    • Insight into Cassandra with JConsole
    • Connecting with JConsole over a SOCKS proxy
    • Connecting to Cassandra with Java and Thrift
    • Chapter 2: The Command-line Interface
      • Connecting to Cassandra with the CLI
      • Creating a keyspace from the CLI
      • Creating a column family with the CLI
      • Describing a keyspace
      • Writing data with the CLI
      • Reading data with the CLI
      • Deleting rows and columns from the CLI
      • Listing and paginating all rows in a column family
      • Dropping a keyspace or a column family
      • CLI operations with super columns
      • Using the assume keyword to decode column names or column values
      • Supplying time to live information when inserting columns
      • Using built-in CLI functions
      • Using column metadata and comparators for type enforcement
      • Changing the consistency level of the CLI
      • Getting help from the CLI
      • Loading CLI statements from a file
      • Chapter 3: Application Programmer Interface
        • Introduction
        • Connecting to a Cassandra server
        • Creating a keyspace and column family from the client
        • Using MultiGet to limit round trips and overhead
        • Writing unit tests with an embedded Cassandra server
        • Cleaning up data directories before unit tests
        • Generating Thrift bindings for other languages (C++, PHP, and others)
        • Using the Cassandra Storage Proxy "Fat Client"
        • Using range scans to find and remove old data
        • Iterating all the columns of a large key
        • Slicing columns in reverse
        • Batch mutations to improve insert performance and code robustness
        • Using TTL to create columns with self-deletion times
        • Working with secondary indexes
        • Chapter 4: Performance Tuning
          • Introduction
          • Choosing an operating system and distribution
          • Choosing a Java Virtual Machine
          • Using a dedicated Commit Log disk
          • Choosing a high performing RAID level
          • File system optimization for hard disk performance
          • Boosting read performance with the Key Cache
          • Boosting read performance with the Row Cache
          • Disabling Swap Memory for predictable performance
          • Stopping Cassandra from using swap without disabling it system-wide
          • Enabling Memory Mapped Disk modes
          • Tuning Memtables for write-heavy workloads
          • Saving memory on 64 bit architectures with compressed pointers
          • Tuning concurrent readers and writers for throughput
          • Setting compaction thresholds
          • Garbage collection tuning to avoid JVM pauses
          • Raising the open file limit to deal with many clients
          • Increasing performance by scaling up
          • Chapter 5: Consistency, Availability, and Partition Tolerance with Cassandra
            • Introduction
            • Working with the formula for strong consistency
            • Supplying the timestamp value with write requests
            • Disabling the hinted handoff mechanism
            • Adjusting read repair chance for less intensive data reads
            • Confirming schema agreement across the cluster
            • Adjusting replication factor to work with quorum
            • Using write consistency ONE, read consistency ONE for low latency operations
            • Using write consistency QUORUM, read consistency QUORUM for strong consistency
            • Mixing levels write consistency QUORUM, read consistency ONE
            • Choosing consistency over availability consistency ALL
            • Choosing availability over consistency with write consistency ANY
            • Demonstrating how consistency is not a lock or a transaction
            • Chapter 6: Schema Design
              • Introduction
              • Saving disk space by using small column names
              • Serializing data into large columns for smaller index sizes
              • Storing time series data effectively
              • Using Super Columns for nested maps
              • Using a lower Replication Factor for disk space saving and performance enhancements
              • Hybrid Random Partitioner using Order Preserving Partitioner
              • Storing large objects
              • Using Cassandra for distributed caching
              • Storing large or infrequently accessed data in a separate column family
              • Storing and searching edge graph data in Cassandra
              • Developing secondary data orderings or indexes
              • Chapter 7: Administration
                • Defining seed nodes for Gossip Communication
                • Nodetool Move: Moving a node to a specific ring location
                • Nodetool Remove: Removing a downed node
                • Nodetool Decommission: Removing a live node
                • Joining nodes quickly with auto_bootstrap set to false
                • Generating SSH keys for password-less interaction
                • Copying the data directory to new hardware
                • A node join using external data copy methods
                • Nodetool Repair: When to use anti-entropy repair
                • Nodetool Drain: Stable files on upgrade
                • Lowering gc_grace for faster tombstone cleanup
                • Scheduling Major Compaction
                • Using nodetool snapshot for backups
                • Clearing snapshots with nodetool clearsnapshot
                • Restoring from a snapshot
                • Exporting data to JSON with sstable2json
                • Nodetool cleanup: Removing excess data
                • Nodetool Compact: Defragment data and remove deleted data from disk
                • Chapter 8: Multiple Datacenter Deployments
                  • Changing debugging to determine where read operations are being routed
                  • Using IPTables to simulate complex network scenarios in a local environment
                  • Choosing IP addresses to work with RackInferringSnitch
                  • Scripting a multiple datacenter installation
                  • Determining natural endpoints, datacenter, and rack for a given key
                  • Manually specifying Rack and Datacenter configuration with a property file snitch
                  • Troubleshooting dynamic snitch using JConsole
                  • Quorum operations in multi-datacenter environments
                  • Using traceroute to troubleshoot latency between network devices
                  • Ensuring bandwidth between switches in multiple rack environments
                  • Increasing rpc_timeout for dealing with latency across datacenters
                  • Changing consistency level from the CLI to test various consistency levels with multiple datacenter deployments
                  • Using the consistency levels TWO and THREE
                  • Calculating Ideal Initial Tokens for use with Network Topology Strategy and Random Partitioner
                  • Chapter 9: Coding and Internals
                    • Introduction
                    • Installing common development tools
                    • Building Cassandra from source
                    • Creating your own type by sub classing abstract type
                    • Using the validation to check data on insertion
                    • Communicating with the Cassandra developers and users through IRC and e-mail
                    • Generating a diff using subversion's diff feature
                    • Applying a diff using the patch command
                    • Using strings and od to quickly search through data files
                    • Customizing the sstable2json export utility
                    • Configure index interval ratio for lower memory usage
                    • Increasing phi_convict_threshold for less reliable networks
                    • Using the Cassandra maven plugin
                    • Chapter 10: Libraries and Applications
                      • Introduction
                      • Building the contrib stress tool for benchmarking
                      • Inserting and reading data with the stress tool
                      • Running the Yahoo! Cloud Serving Benchmark
                      • Hector, a high-level client for Cassandra
                      • Doing batch mutations with Hector
                      • Cassandra with Java Persistence Architecture (JPA)
                      • Setting up Solandra for full text indexing with a Cassandra backend
                      • Setting up Zookeeper to support Cages for transactional locking
                      • Using Cages to implement an atomic read and set
                      • Using Groovandra as a CLI alternative
                      • Searchable log storage with Logsandra
                      • Chapter 11: Hadoop and Cassandra
                        • Introduction
                        • A pseudo-distributed Hadoop setup
                        • A Map-only program that reads from Cassandra using the ColumnFamilyInputFormat
                        • A Map-only program that writes to Cassandra using the CassandraOutputFormat
                        • Using MapReduce to do grouping and counting with Cassandra input and output
                        • Setting up Hive with Cassandra Storage Handler support
                        • Defining a Hive table over a Cassandra Column Family
                        • Joining two Column Families with Hive
                        • Grouping and counting column values with Hive
                        • Co-locating Hadoop Task Trackers on Cassandra nodes
                        • Setting up a "Shadow" data center for running only MapReduce jobs
                        • Setting up DataStax Brisk the combined stack of Cassandra, Hadoop, and Hive
                        • Chapter 12: Collecting and Analyzing Performance Statistics
                          • Finding bottlenecks with nodetool tpstats
                          • Using nodetool cfstats to retrieve column family statistics
                          • Monitoring CPU utilization
                          • Adding read/write graphs to find active column families
                          • Using Memtable graphs to profile when and why they flush
                          • Graphing SSTable count
                          • Monitoring disk utilization and having a performance baseline
                          • Monitoring compaction by graphing its activity
                          • Using nodetool compaction stats to check the progress of compaction
                          • Graphing column family statistics to track average/max row sizes
                          • Using latency graphs to profile time to seek keys
                          • Tracking the physical disk size of each column family over time
                          • Using nodetool cfhistograms to see the distribution of query latencies
                          • Tracking open networking connections
                          • Chapter 13: Monitoring Cassandra Servers
                            • Introduction
                            • Forwarding Log4j logs to a central sever
                            • Using top to understand overall performance
                            • Using iostat to monitor current disk performance
                            • Using sar to review performance over time
                            • Using JMXTerm to access Cassandra JMX
                            • Monitoring the garbage collection events
                            • Using tpstats to find bottlenecks
                            • Creating a Nagios Check Script for Cassandra
                            • Keep an eye out for large rows with compaction limits
                            • Reviewing network traffic with IPTraf
                            • Keep on the lookout for dropped messages
                            • Inspecting column families for dangerous conditions

                            Edward Capriolo

                            Edward Capriolo, who also authored the previous book, Cassandra High Performance Cookbook, is currently system administrator at Media6degrees where he helps design and maintain distributed data storage systems for the Internet advertising industry. Edward is a member of the Apache Software Foundation and a committer for the Hadoop-Hive project. He has experience as a developer as well as a Linux and network administrator and enjoys the rich world of open source software.
                            Sorry, we don't have any reviews for this title yet.

                            Code Downloads

                            Download the code and support files for this book.

                            Submit Errata

                            Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


                            - 4 submitted: last submission 12 Nov 2012

                            Errata type:Typo Page no:20

                            The word 'Consistent' is spelled 'Consistant' in the Consistent Hashing illustration at the bottom of the page.

                            Errata type: Others | Page number: 2

                            Chapter 5, Consistency, Availability, and Partition Tolerance with Cassandra: Cassandra is designed from the ground up to store and replicate data across multiple nodes. This chapter has recipes that utilize tunable consistency levels and configure features like read repair. These recipes demonstrate how to use features of Cassandra that make available even in the case of failures or network partitions.

                            should be:

                            Chapter 5, Consistency, Availability, and Partition Tolerance with Cassandra: Cassandra is designed from the ground up to store and replicate data across multiple nodes. This chapter has recipes that demonstrate how features like tunable consistency levels, read repair, and replication factor make data available even in the case of failures or network partitions.

                            Errata type: Typo | Page number: 20

                            The word 'Consistent' is spelled 'Consistant' in the Consistent Hashing illustration at the bottom of the page.


                            Errata type: code | Page number: 89

                            The recipe "Stopping Cassandra from using swap..." mentions the configuration parameter "memory_locking_policy" in the file cassandra.yaml. This property does not exist. The solution is given at


                            Sample chapters

                            You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                            Frequently bought together

                            Cassandra High Performance Cookbook +    PhoneGap 3.x Mobile Application Development Hotshot =
                            50% Off
                            the second eBook
                            Price for both: £26.35

                            Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                            What you will learn from this book

                            • Interact with Cassandra using the command line interface
                            • Write programs that access data in Cassandra
                            • Configure and tune Cassandra components to enhance performance
                            • Model data to optimize storage and access
                            • Use tunable consistency to optimize data access
                            • Deploy Cassandra in single and multiple data center environments
                            • Monitor the performance of Cassandra
                            • Manage a cluster by joining and removing nodes
                            • Use libraries and third party applications with Cassandra
                            • Integrate Cassandra with Hadoop

                            In Detail

                            Apache Cassandra is a fault-tolerant, distributed data store which offers linear scalability allowing it to be a storage platform for large high volume websites.

                            This book provides detailed recipes that describe how to use the features of Cassandra and improve its performance. Recipes cover topics ranging from setting up Cassandra for the first time to complex multiple data center installations. The recipe format presents the information in a concise actionable form.

                            The book describes in detail how features of Cassandra can be tuned and what the possible effects of tuning can be. Recipes include how to access data stored in Cassandra and use third party tools to help you out. The book also describes how to monitor and do capacity planning to ensure it is performing at a high level. Towards the end, it takes you through the use of libraries and third party applications with Cassandra and Cassandra integration with Hadoop.


                            This is a cookbook and all tasks are approached as recipes. A recipe describes a task and outlines the steps necessary to complete this task.

                            Some recipes in the book are examples of writing code. An example of this is a recipe that stores and accesses the entries of a phone book in Cassandra. The recipe consists of a description of the program, a full code example is given, the example is run, the output is displayed, and finally the how it works section describes the process or code in greater detail.

                            Other recipes in the book describe a task. An example of this is a recipe that takes a snapshot back up of data in Cassandra. This recipe contains a description of the process, it then shows how to run the snapshot command and confirm that it worked, it then explains what the snapshot command does behind the scenes, finally the ‘see also’ section references other related recipes such as the recipe to restore a snapshot.

                            Who this book is for

                            This book is designed for administrators, developers, and data architects who are interested in Apache Cassandra for redundant, highly performing, and scalable data storage. Typically these users should have experience working with a database technology, multiple node computer clusters, and high availability solutions.

                            Code Download and Errata
                            Packt Anytime, Anywhere
                            Register Books
                            Print Upgrades
                            eBook Downloads
                            Video Support
                            Contact Us
                            Awards Voting Nominations Previous Winners
                            Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                            Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software