Cassandra High Performance Cookbook

You can mine deep into the full capabilities of Apache Cassandra using the 150+ recipes in this indispensable Cookbook. From configuring and tuning to using third party applications, this is the ultimate guide.

Cassandra High Performance Cookbook

Cookbook
Edward Capriolo

You can mine deep into the full capabilities of Apache Cassandra using the 150+ recipes in this indispensable Cookbook. From configuring and tuning to using third party applications, this is the ultimate guide.
$10.00
$44.99
RRP $26.99
RRP $44.99
eBook
Print + eBook
$12.99 p/month

Want this title & more? Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.
+ Collection
Free sample

Book Details

ISBN 139781849515122
Paperback310 pages

About This Book

  • Get the best out of Cassandra using this efficient recipe bank
  • Configure and tune Cassandra components to enhance performance
  • Deploy Cassandra in various environments and monitor its performance
  • Well illustrated, step-by-step recipes to make all tasks look easy!

Who This Book Is For

This book is designed for administrators, developers, and data architects who are interested in Apache Cassandra for redundant, highly performing, and scalable data storage. Typically these users should have experience working with a database technology, multiple node computer clusters, and high availability solutions.

Table of Contents

Chapter 1: Getting Started
Introduction
A simple single node Cassandra installation
Reading and writing test data using the command-line interface
Running multiple instances on a single machine
Scripting a multiple instance installation
Setting up a build and test environment for tasks in this book
Running in the foreground with full debugging
Calculating ideal Initial Tokens for use with Random Partitioner
Choosing Initial Tokens for use with Partitioners that preserve ordering
Insight into Cassandra with JConsole
Connecting with JConsole over a SOCKS proxy
Connecting to Cassandra with Java and Thrift
Chapter 2: The Command-line Interface
Connecting to Cassandra with the CLI
Creating a keyspace from the CLI
Creating a column family with the CLI
Describing a keyspace
Writing data with the CLI
Reading data with the CLI
Deleting rows and columns from the CLI
Listing and paginating all rows in a column family
Dropping a keyspace or a column family
CLI operations with super columns
Using the assume keyword to decode column names or column values
Supplying time to live information when inserting columns
Using built-in CLI functions
Using column metadata and comparators for type enforcement
Changing the consistency level of the CLI
Getting help from the CLI
Loading CLI statements from a file
Chapter 3: Application Programmer Interface
Introduction
Connecting to a Cassandra server
Creating a keyspace and column family from the client
Using MultiGet to limit round trips and overhead
Writing unit tests with an embedded Cassandra server
Cleaning up data directories before unit tests
Generating Thrift bindings for other languages (C++, PHP, and others)
Using the Cassandra Storage Proxy "Fat Client"
Using range scans to find and remove old data
Iterating all the columns of a large key
Slicing columns in reverse
Batch mutations to improve insert performance and code robustness
Using TTL to create columns with self-deletion times
Working with secondary indexes
Chapter 4: Performance Tuning
Introduction
Choosing an operating system and distribution
Choosing a Java Virtual Machine
Using a dedicated Commit Log disk
Choosing a high performing RAID level
File system optimization for hard disk performance
Boosting read performance with the Key Cache
Boosting read performance with the Row Cache
Disabling Swap Memory for predictable performance
Stopping Cassandra from using swap without disabling it system-wide
Enabling Memory Mapped Disk modes
Tuning Memtables for write-heavy workloads
Saving memory on 64 bit architectures with compressed pointers
Tuning concurrent readers and writers for throughput
Setting compaction thresholds
Garbage collection tuning to avoid JVM pauses
Raising the open file limit to deal with many clients
Increasing performance by scaling up
Chapter 5: Consistency, Availability, and Partition Tolerance with Cassandra
Introduction
Working with the formula for strong consistency
Supplying the timestamp value with write requests
Disabling the hinted handoff mechanism
Adjusting read repair chance for less intensive data reads
Confirming schema agreement across the cluster
Adjusting replication factor to work with quorum
Using write consistency ONE, read consistency ONE for low latency operations
Using write consistency QUORUM, read consistency QUORUM for strong consistency
Mixing levels write consistency QUORUM, read consistency ONE
Choosing consistency over availability consistency ALL
Choosing availability over consistency with write consistency ANY
Demonstrating how consistency is not a lock or a transaction
Chapter 6: Schema Design
Introduction
Saving disk space by using small column names
Serializing data into large columns for smaller index sizes
Storing time series data effectively
Using Super Columns for nested maps
Using a lower Replication Factor for disk space saving and performance enhancements
Hybrid Random Partitioner using Order Preserving Partitioner
Storing large objects
Using Cassandra for distributed caching
Storing large or infrequently accessed data in a separate column family
Storing and searching edge graph data in Cassandra
Developing secondary data orderings or indexes
Chapter 7: Administration
Defining seed nodes for Gossip Communication
Nodetool Move: Moving a node to a specific ring location
Nodetool Remove: Removing a downed node
Nodetool Decommission: Removing a live node
Joining nodes quickly with auto_bootstrap set to false
Generating SSH keys for password-less interaction
Copying the data directory to new hardware
A node join using external data copy methods
Nodetool Repair: When to use anti-entropy repair
Nodetool Drain: Stable files on upgrade
Lowering gc_grace for faster tombstone cleanup
Scheduling Major Compaction
Using nodetool snapshot for backups
Clearing snapshots with nodetool clearsnapshot
Restoring from a snapshot
Exporting data to JSON with sstable2json
Nodetool cleanup: Removing excess data
Nodetool Compact: Defragment data and remove deleted data from disk
Chapter 8: Multiple Datacenter Deployments
Changing debugging to determine where read operations are being routed
Using IPTables to simulate complex network scenarios in a local environment
Choosing IP addresses to work with RackInferringSnitch
Scripting a multiple datacenter installation
Determining natural endpoints, datacenter, and rack for a given key
Manually specifying Rack and Datacenter configuration with a property file snitch
Troubleshooting dynamic snitch using JConsole
Quorum operations in multi-datacenter environments
Using traceroute to troubleshoot latency between network devices
Ensuring bandwidth between switches in multiple rack environments
Increasing rpc_timeout for dealing with latency across datacenters
Changing consistency level from the CLI to test various consistency levels with multiple datacenter deployments
Using the consistency levels TWO and THREE
Calculating Ideal Initial Tokens for use with Network Topology Strategy and Random Partitioner
Chapter 9: Coding and Internals
Introduction
Installing common development tools
Building Cassandra from source
Creating your own type by sub classing abstract type
Using the validation to check data on insertion
Communicating with the Cassandra developers and users through IRC and e-mail
Generating a diff using subversion's diff feature
Applying a diff using the patch command
Using strings and od to quickly search through data files
Customizing the sstable2json export utility
Configure index interval ratio for lower memory usage
Increasing phi_convict_threshold for less reliable networks
Using the Cassandra maven plugin
Chapter 10: Libraries and Applications
Introduction
Building the contrib stress tool for benchmarking
Inserting and reading data with the stress tool
Running the Yahoo! Cloud Serving Benchmark
Hector, a high-level client for Cassandra
Doing batch mutations with Hector
Cassandra with Java Persistence Architecture (JPA)
Setting up Solandra for full text indexing with a Cassandra backend
Setting up Zookeeper to support Cages for transactional locking
Using Cages to implement an atomic read and set
Using Groovandra as a CLI alternative
Searchable log storage with Logsandra
Chapter 11: Hadoop and Cassandra
Introduction
A pseudo-distributed Hadoop setup
A Map-only program that reads from Cassandra using the ColumnFamilyInputFormat
A Map-only program that writes to Casandra using the CassandraOutputFormat
Using MapReduce to do grouping and counting with Cassandra input and output
Setting up Hive with Cassandra Storage Handler support
Defining a Hive table over a Cassandra Column Family
Joining two Column Families with Hive
Grouping and counting column values with Hive
Co-locating Hadoop Task Trackers on Cassandra nodes
Setting up a "Shadow" data center for running only MapReduce jobs
Setting up DataStax Brisk the combined stack of Cassandra, Hadoop, and Hive
Chapter 12: Collecting and Analyzing Performance Statistics
Finding bottlenecks with nodetool tpstats
Using nodetool cfstats to retrieve column family statistics
Monitoring CPU utilization
Adding read/write graphs to find active column families
Using Memtable graphs to profile when and why they flush
Graphing SSTable count
Monitoring disk utilization and having a performance baseline
Monitoring compaction by graphing its activity
Using nodetool compaction stats to check the progress of compaction
Graphing column family statistics to track average/max row sizes
Using latency graphs to profile time to seek keys
Tracking the physical disk size of each column family over time
Using nodetool cfhistograms to see the distribution of query latencies
Tracking open networking connections
Chapter 13: Monitoring Cassandra Servers
Introduction
Forwarding Log4j logs to a central sever
Using top to understand overall performance
Using iostat to monitor current disk performance
Using sar to review performance over time
Using JMXTerm to access Cassandra JMX
Monitoring the garbage collection events
Using tpstats to find bottlenecks
Creating a Nagios Check Script for Cassandra
Keep an eye out for large rows with compaction limits
Reviewing network traffic with IPTraf
Keep on the lookout for dropped messages
Inspecting column families for dangerous conditions

What You Will Learn

  • Interact with Cassandra using the command line interface
  • Write programs that access data in Cassandra
  • Configure and tune Cassandra components to enhance performance
  • Model data to optimize storage and access
  • Use tunable consistency to optimize data access
  • Deploy Cassandra in single and multiple data center environments
  • Monitor the performance of Cassandra
  • Manage a cluster by joining and removing nodes
  • Use libraries and third party applications with Cassandra
  • Integrate Cassandra with Hadoop

In Detail

Apache Cassandra is a fault-tolerant, distributed data store which offers linear scalability allowing it to be a storage platform for large high volume websites.

This book provides detailed recipes that describe how to use the features of Cassandra and improve its performance. Recipes cover topics ranging from setting up Cassandra for the first time to complex multiple data center installations. The recipe format presents the information in a concise actionable form.

The book describes in detail how features of Cassandra can be tuned and what the possible effects of tuning can be. Recipes include how to access data stored in Cassandra and use third party tools to help you out. The book also describes how to monitor and do capacity planning to ensure it is performing at a high level. Towards the end, it takes you through the use of libraries and third party applications with Cassandra and Cassandra integration with Hadoop.

Authors

Table of Contents

Chapter 1: Getting Started
Introduction
A simple single node Cassandra installation
Reading and writing test data using the command-line interface
Running multiple instances on a single machine
Scripting a multiple instance installation
Setting up a build and test environment for tasks in this book
Running in the foreground with full debugging
Calculating ideal Initial Tokens for use with Random Partitioner
Choosing Initial Tokens for use with Partitioners that preserve ordering
Insight into Cassandra with JConsole
Connecting with JConsole over a SOCKS proxy
Connecting to Cassandra with Java and Thrift
Chapter 2: The Command-line Interface
Connecting to Cassandra with the CLI
Creating a keyspace from the CLI
Creating a column family with the CLI
Describing a keyspace
Writing data with the CLI
Reading data with the CLI
Deleting rows and columns from the CLI
Listing and paginating all rows in a column family
Dropping a keyspace or a column family
CLI operations with super columns
Using the assume keyword to decode column names or column values
Supplying time to live information when inserting columns
Using built-in CLI functions
Using column metadata and comparators for type enforcement
Changing the consistency level of the CLI
Getting help from the CLI
Loading CLI statements from a file
Chapter 3: Application Programmer Interface
Introduction
Connecting to a Cassandra server
Creating a keyspace and column family from the client
Using MultiGet to limit round trips and overhead
Writing unit tests with an embedded Cassandra server
Cleaning up data directories before unit tests
Generating Thrift bindings for other languages (C++, PHP, and others)
Using the Cassandra Storage Proxy "Fat Client"
Using range scans to find and remove old data
Iterating all the columns of a large key
Slicing columns in reverse
Batch mutations to improve insert performance and code robustness
Using TTL to create columns with self-deletion times
Working with secondary indexes
Chapter 4: Performance Tuning
Introduction
Choosing an operating system and distribution
Choosing a Java Virtual Machine
Using a dedicated Commit Log disk
Choosing a high performing RAID level
File system optimization for hard disk performance
Boosting read performance with the Key Cache
Boosting read performance with the Row Cache
Disabling Swap Memory for predictable performance
Stopping Cassandra from using swap without disabling it system-wide
Enabling Memory Mapped Disk modes
Tuning Memtables for write-heavy workloads
Saving memory on 64 bit architectures with compressed pointers
Tuning concurrent readers and writers for throughput
Setting compaction thresholds
Garbage collection tuning to avoid JVM pauses
Raising the open file limit to deal with many clients
Increasing performance by scaling up
Chapter 5: Consistency, Availability, and Partition Tolerance with Cassandra
Introduction
Working with the formula for strong consistency
Supplying the timestamp value with write requests
Disabling the hinted handoff mechanism
Adjusting read repair chance for less intensive data reads
Confirming schema agreement across the cluster
Adjusting replication factor to work with quorum
Using write consistency ONE, read consistency ONE for low latency operations
Using write consistency QUORUM, read consistency QUORUM for strong consistency
Mixing levels write consistency QUORUM, read consistency ONE
Choosing consistency over availability consistency ALL
Choosing availability over consistency with write consistency ANY
Demonstrating how consistency is not a lock or a transaction
Chapter 6: Schema Design
Introduction
Saving disk space by using small column names
Serializing data into large columns for smaller index sizes
Storing time series data effectively
Using Super Columns for nested maps
Using a lower Replication Factor for disk space saving and performance enhancements
Hybrid Random Partitioner using Order Preserving Partitioner
Storing large objects
Using Cassandra for distributed caching
Storing large or infrequently accessed data in a separate column family
Storing and searching edge graph data in Cassandra
Developing secondary data orderings or indexes
Chapter 7: Administration
Defining seed nodes for Gossip Communication
Nodetool Move: Moving a node to a specific ring location
Nodetool Remove: Removing a downed node
Nodetool Decommission: Removing a live node
Joining nodes quickly with auto_bootstrap set to false
Generating SSH keys for password-less interaction
Copying the data directory to new hardware
A node join using external data copy methods
Nodetool Repair: When to use anti-entropy repair
Nodetool Drain: Stable files on upgrade
Lowering gc_grace for faster tombstone cleanup
Scheduling Major Compaction
Using nodetool snapshot for backups
Clearing snapshots with nodetool clearsnapshot
Restoring from a snapshot
Exporting data to JSON with sstable2json
Nodetool cleanup: Removing excess data
Nodetool Compact: Defragment data and remove deleted data from disk
Chapter 8: Multiple Datacenter Deployments
Changing debugging to determine where read operations are being routed
Using IPTables to simulate complex network scenarios in a local environment
Choosing IP addresses to work with RackInferringSnitch
Scripting a multiple datacenter installation
Determining natural endpoints, datacenter, and rack for a given key
Manually specifying Rack and Datacenter configuration with a property file snitch
Troubleshooting dynamic snitch using JConsole
Quorum operations in multi-datacenter environments
Using traceroute to troubleshoot latency between network devices
Ensuring bandwidth between switches in multiple rack environments
Increasing rpc_timeout for dealing with latency across datacenters
Changing consistency level from the CLI to test various consistency levels with multiple datacenter deployments
Using the consistency levels TWO and THREE
Calculating Ideal Initial Tokens for use with Network Topology Strategy and Random Partitioner
Chapter 9: Coding and Internals
Introduction
Installing common development tools
Building Cassandra from source
Creating your own type by sub classing abstract type
Using the validation to check data on insertion
Communicating with the Cassandra developers and users through IRC and e-mail
Generating a diff using subversion's diff feature
Applying a diff using the patch command
Using strings and od to quickly search through data files
Customizing the sstable2json export utility
Configure index interval ratio for lower memory usage
Increasing phi_convict_threshold for less reliable networks
Using the Cassandra maven plugin
Chapter 10: Libraries and Applications
Introduction
Building the contrib stress tool for benchmarking
Inserting and reading data with the stress tool
Running the Yahoo! Cloud Serving Benchmark
Hector, a high-level client for Cassandra
Doing batch mutations with Hector
Cassandra with Java Persistence Architecture (JPA)
Setting up Solandra for full text indexing with a Cassandra backend
Setting up Zookeeper to support Cages for transactional locking
Using Cages to implement an atomic read and set
Using Groovandra as a CLI alternative
Searchable log storage with Logsandra
Chapter 11: Hadoop and Cassandra
Introduction
A pseudo-distributed Hadoop setup
A Map-only program that reads from Cassandra using the ColumnFamilyInputFormat
A Map-only program that writes to Casandra using the CassandraOutputFormat
Using MapReduce to do grouping and counting with Cassandra input and output
Setting up Hive with Cassandra Storage Handler support
Defining a Hive table over a Cassandra Column Family
Joining two Column Families with Hive
Grouping and counting column values with Hive
Co-locating Hadoop Task Trackers on Cassandra nodes
Setting up a "Shadow" data center for running only MapReduce jobs
Setting up DataStax Brisk the combined stack of Cassandra, Hadoop, and Hive
Chapter 12: Collecting and Analyzing Performance Statistics
Finding bottlenecks with nodetool tpstats
Using nodetool cfstats to retrieve column family statistics
Monitoring CPU utilization
Adding read/write graphs to find active column families
Using Memtable graphs to profile when and why they flush
Graphing SSTable count
Monitoring disk utilization and having a performance baseline
Monitoring compaction by graphing its activity
Using nodetool compaction stats to check the progress of compaction
Graphing column family statistics to track average/max row sizes
Using latency graphs to profile time to seek keys
Tracking the physical disk size of each column family over time
Using nodetool cfhistograms to see the distribution of query latencies
Tracking open networking connections
Chapter 13: Monitoring Cassandra Servers
Introduction
Forwarding Log4j logs to a central sever
Using top to understand overall performance
Using iostat to monitor current disk performance
Using sar to review performance over time
Using JMXTerm to access Cassandra JMX
Monitoring the garbage collection events
Using tpstats to find bottlenecks
Creating a Nagios Check Script for Cassandra
Keep an eye out for large rows with compaction limits
Reviewing network traffic with IPTraf
Keep on the lookout for dropped messages
Inspecting column families for dangerous conditions

Book Details

ISBN 139781849515122
Paperback310 pages
Read More