Reader small image

You're reading from  Fast Data Processing Systems with SMACK Stack

Product typeBook
Published inDec 2016
Reading LevelIntermediate
PublisherPackt
ISBN-139781786467201
Edition1st Edition
Languages
Right arrow
Author (1)
Raúl Estrada
Raúl Estrada
author image
Raúl Estrada

Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods. Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.
Read more about Raúl Estrada

Right arrow

Chapter 4.  The Storage - Apache Cassandra

We have reached the part where we talk about storage. The C in the SMACK stack refers to Cassandra. The reader may wonder, why not use a conventional database? The answer is that Cassandra is the database that propels giants such as Walmart, CERN, Cisco, Facebook, Netflix, and Twitter. Spark uses a lot of Cassandra's power. Application efficiency is greatly increased using the Spark Cassandra Connector.

This chapter has the following sections:

  • A bit of history
  • NoSQL
  • Apache Cassandra installation
  • Authentication and authorization (roles)
  • Backup
  • Recovery
  • Spark-Cassandra connector

A bit of history


In Greek mythology, there was a priestess who was chastised for her treason against the God, Apollo. She asked for the power of prophecy in exchange for a carnal meeting; however, she failed to fulfill her part of the deal. So, she received a punishment; she would have the power of prophecy, but no one would ever believe her forecasts. This priestess's name was Cassandra.

Moving to more recent times, let's say 50 years ago, in the world of computing there have been big changes. In 1960, the HDD (Hard Disk Drive) took precedence over the magnetic strips facilitating data handling. In 1966, IBM created the Information Management System (IMS) for the Apollo space program from whose hierarchical models later developed IBM DB2. In 1970s, a model that is fundamentally changing the existing data storage methods appeared, called the relational data model. Devised by Codd as an alternative to IBM's IMS and its organization mode and data storage in 1985, his work presented 12 rules...

NoSQL


In this book, we will read NoSQL as Not only SQL (SQL, Structured Query Language). NoSQL is a distributed database with an emphasis on scalability, high availability, and ease of administration the opposite of established relational databases. Don't think of it as a direct replacement for RDBMS, rather, as an alternative or a complement. The focus is in avoiding unnecessary complexity, the solution for data storage according to today's needs, and fixed schemes. Due its distributed nature, cloud computing is a great NoSQL sponsor.

A NoSQL database model can be:

  • Key-value/tuple based

For example, Redis, Oracle NoSQL (ACID-compliant), Riak, Tokyo Cabinet / Tyrant, Voldemort, Amazon Dynamo, and Memcached and is used by Linked-In, Amazon, BestBuy, Github, and AOL.

  • Wide Row/column-oriented-based

For example, Google BigTable, Apache Cassandra, Hbase/Hypertable, and Amazon SimpleDB used by Amazon, Google, Facebook, and RealNetworks

  • Document-based

For example, CouchDB (ACID-compliant), MongoDB, TerraStore...

Apache Cassandra installation


In Facebook laboratories, although not visible to the public, new software is developed, for example, the junction between two concepts involving the development departments of Google and Amazon. In short, Cassandra is defined as a distributed database. From the start, the authors undertook the task of creating a scalable database massively decentralized, optimized for read operations when possible, painlessly modifying data structures, and , for all this, not difficult to manage. The solution was found by combining two existing technologies: Google's BigTable and Amazon's Dynamo. One of the two authors, A. Lakshman, had earlier worked on BigTable and he borrowed the data model layout, while Dynamo contributed with the overall distributed architecture.

Cassandra is written in Java and for good performance it requires the latest possible JDK version. In Cassandra 1.0, they used another open source project Thrift for client access, which also came from Facebook...

Authentication and authorization (roles)


In Cassandra, the authentication and authorization must be configured on the cassandra.yaml file and two additional files. The first file is used to assign rights to users over the key space and column family, while the second assigns passwords to users. These files are called access.properties and passwd.properties, and are located in the Cassandra installation directory. These files can be opened using our favorite text editor in order to be successfully configured.

Setting up a simple authentication and authorization

Perform the following steps:

  1. In the access.properites file we add the access rights to users and the permissions to read and write certain key spaces and columnfamily. Syntax:
            keyspace.columnfamily.permits = users 
     
            Example 1: 
            hr <rw> = restrada 
     
            Example 2: 
            hr.cars <ro> = restrada, raparicio 
    
    • In example 1, we give full rights in the Key Space hr to restrada while in example 2 we...

Backup


The purpose of making Cassandra a NoSQL database is because when we create a single node, we make a copy of it. Copying the database to other nodes and the exact number of copies depend on the replication factor established when we create a new key space.

But as with any other standard SQL database, Cassandra offers to create a backup on the local computer. Cassandra creates a copy of the base using a snapshot. It is possible to make a snapshot of all the key spaces, or just one column family. It is also possible to make a snapshot of the entire cluster using the parallel SSH tool (pssh).

If the user decides to snapshot the entire cluster, it can be reinitiated and uses an incremental backup on each node.

Incremental backups provide a way to get each node configured separately, through setting the incremental_backups flag to true in cassandra.yaml.

When incremental backups are enabled, Cassandra hard-links each flushed SSTable to a backups directory under the key space data directory...

Recovery


Recovering a key space snapshot requests all the snapshots made for a certain column family. If you use an incremental backup, it is also necessary to provide the incremental backups created after the snapshot. There are multiple ways to perform a recovery from the snapshot. We can use the SSTable loader tool (used exclusively on the Linux distribution) or can recreate the installation method.

Restart node

If the recovery is running on one node, we must first shutdown the node. If the recovery is for the entire cluster, it is necessary to restart each node in the cluster. Here is the procedure:

  1. Shut down the node.
  2. Delete all the log files in: C:\Program Files\DataStax Community\logs.
  3. Delete all .db files within a specified key space and column family: C:\Program Files\DataStax Community\data\data\en\cars.
  4. Locate all Snapshots related to the column family: C:\Program Files\DataStax Community\data\data\en\cars\snapshots\1,351,279,613,842,.
  5. Copy them to: C:\Program Files\DataStax Community...

Spark-Cassandra connector


Now that we are clear how a connection to a Cassandra server is done, we can talk about a very special client. Everything we have seen previously has been directed at reaching this point. We have seen what Spark can do; now we know Cassandra and we know we can use it as a storage layer to improve Spark performance.

We need a client to achieve this connection but this client is special because it has been designed specifically for Spark and not for a specific language. This special client is called: Spark Cassandra connector.

Installing the connector

The Spark-Cassandra connector has its own GitHub repository, the latest stable version is in master, but we can access a special version through a particular branch.

The Cassandra connector project home page is: https://github.com/datastax/spark-cassandra-connector .

At the time of writing, the most stable connector version is 1.6.0.

The connector is basically a .jar file loaded when Spark starts. So, if you prefer to access...

Summary


NoSQL is not just hype, or a young technology; it is an alternative, with known limitations and capabilities. It is not an RDBMS killer. It's more like a younger brother who is slowly growing up and taking some of the burden. Acceptance is increasing and it will be even better as NoSQL solutions mature. Skepticism may be justified, but only for concrete reasons.

Since Cassandra is an easy and free working environment, suitable for application development, we recommended it, especially with additional utilities that ease and accelerate database administration.

Cassandra has some faults (for example, user authentication and authorization are still insufficiently supported in Windows environments) and should preferably be used when there is a need to store large amounts of data.

For start-up companies that need to manipulate large amounts of data with the aim of costs reduction, implementing Cassandra in a Linux environment is a must-have.

In the next chapter, we will explore Kafka, an...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Fast Data Processing Systems with SMACK Stack
Published in: Dec 2016Publisher: PacktISBN-13: 9781786467201
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Raúl Estrada

Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods. Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.
Read more about Raúl Estrada