Packt+ | Advance your knowledge in tech

You're reading from Fast Data Processing Systems with SMACK Stack

Product typeBook

Published inDec 2016

Reading LevelIntermediate

PublisherPackt

ISBN-139781786467201

Edition1st Edition

Languages

Scala

Tools

Mesos Apache Spark

Concepts

Data Processing

Author (1)

Raúl Estrada

Chapter 4. The Storage - Apache Cassandra

We have reached the part where we talk about storage. The C in the SMACK stack refers to Cassandra. The reader may wonder, why not use a conventional database? The answer is that Cassandra is the database that propels giants such as Walmart, CERN, Cisco, Facebook, Netflix, and Twitter. Spark uses a lot of Cassandra's power. Application efficiency is greatly increased using the Spark Cassandra Connector.

This chapter has the following sections:

A bit of history
NoSQL
Apache Cassandra installation
Authentication and authorization (roles)
Backup
Recovery
Spark-Cassandra connector

A bit of history

In Greek mythology, there was a priestess who was chastised for her treason against the God, Apollo. She asked for the power of prophecy in exchange for a carnal meeting; however, she failed to fulfill her part of the deal. So, she received a punishment; she would have the power of prophecy, but no one would ever believe her forecasts. This priestess's name was Cassandra.

Moving to more recent times, let's say 50 years ago, in the world of computing there have been big changes. In 1960, the HDD (Hard Disk Drive) took precedence over the magnetic strips facilitating data handling. In 1966, IBM created the Information Management System (IMS) for the Apollo space program from whose hierarchical models later developed IBM DB2. In 1970s, a model that is fundamentally changing the existing data storage methods appeared, called the relational data model. Devised by Codd as an alternative to IBM's IMS and its organization mode and data storage in 1985, his work presented 12 rules...

NoSQL

In this book, we will read NoSQL as Not only SQL (SQL, Structured Query Language). NoSQL is a distributed database with an emphasis on scalability, high availability, and ease of administration the opposite of established relational databases. Don't think of it as a direct replacement for RDBMS, rather, as an alternative or a complement. The focus is in avoiding unnecessary complexity, the solution for data storage according to today's needs, and fixed schemes. Due its distributed nature, cloud computing is a great NoSQL sponsor.

A NoSQL database model can be:

Key-value/tuple based

For example, Redis, Oracle NoSQL (ACID-compliant), Riak, Tokyo Cabinet / Tyrant, Voldemort, Amazon Dynamo, and Memcached and is used by Linked-In, Amazon, BestBuy, Github, and AOL.

Wide Row/column-oriented-based

For example, Google BigTable, Apache Cassandra, Hbase/Hypertable, and Amazon SimpleDB used by Amazon, Google, Facebook, and RealNetworks

Document-based

For example, CouchDB (ACID-compliant), MongoDB, TerraStore...

Apache Cassandra installation

In Facebook laboratories, although not visible to the public, new software is developed, for example, the junction between two concepts involving the development departments of Google and Amazon. In short, Cassandra is defined as a distributed database. From the start, the authors undertook the task of creating a scalable database massively decentralized, optimized for read operations when possible, painlessly modifying data structures, and , for all this, not difficult to manage. The solution was found by combining two existing technologies: Google's BigTable and Amazon's Dynamo. One of the two authors, A. Lakshman, had earlier worked on BigTable and he borrowed the data model layout, while Dynamo contributed with the overall distributed architecture.

Cassandra is written in Java and for good performance it requires the latest possible JDK version. In Cassandra 1.0, they used another open source project Thrift for client access, which also came from Facebook...

Authentication and authorization (roles)

In Cassandra, the authentication and authorization must be configured on the cassandra.yaml file and two additional files. The first file is used to assign rights to users over the key space and column family, while the second assigns passwords to users. These files are called access.properties and passwd.properties, and are located in the Cassandra installation directory. These files can be opened using our favorite text editor in order to be successfully configured.

Setting up a simple authentication and authorization

Perform the following steps:

In the access.properites file we add the access rights to users and the permissions to read and write certain key spaces and columnfamily. Syntax:
```
        keyspace.columnfamily.permits = users 
 
        Example 1: 
        hr <rw> = restrada 
 
        Example 2: 
        hr.cars <ro> = restrada, raparicio 
```
- In example 1, we give full rights in the Key Space hr to restrada while in example 2 we...

Backup

The purpose of making Cassandra a NoSQL database is because when we create a single node, we make a copy of it. Copying the database to other nodes and the exact number of copies depend on the replication factor established when we create a new key space.

But as with any other standard SQL database, Cassandra offers to create a backup on the local computer. Cassandra creates a copy of the base using a snapshot. It is possible to make a snapshot of all the key spaces, or just one column family. It is also possible to make a snapshot of the entire cluster using the parallel SSH tool (pssh).

If the user decides to snapshot the entire cluster, it can be reinitiated and uses an incremental backup on each node.

Incremental backups provide a way to get each node configured separately, through setting the incremental_backups flag to true in cassandra.yaml.

When incremental backups are enabled, Cassandra hard-links each flushed SSTable to a backups directory under the key space data directory...

Recovery

Recovering a key space snapshot requests all the snapshots made for a certain column family. If you use an incremental backup, it is also necessary to provide the incremental backups created after the snapshot. There are multiple ways to perform a recovery from the snapshot. We can use the SSTable loader tool (used exclusively on the Linux distribution) or can recreate the installation method.

Restart node

If the recovery is running on one node, we must first shutdown the node. If the recovery is for the entire cluster, it is necessary to restart each node in the cluster. Here is the procedure:

Shut down the node.
Delete all the log files in: C:\Program Files\DataStax Community\logs.
Delete all .db files within a specified key space and column family: C:\Program Files\DataStax Community\data\data\en\cars.
Locate all Snapshots related to the column family: C:\Program Files\DataStax Community\data\data\en\cars\snapshots\1,351,279,613,842,.
Copy them to: C:\Program Files\DataStax Community...

Spark-Cassandra connector

Now that we are clear how a connection to a Cassandra server is done, we can talk about a very special client. Everything we have seen previously has been directed at reaching this point. We have seen what Spark can do; now we know Cassandra and we know we can use it as a storage layer to improve Spark performance.

We need a client to achieve this connection but this client is special because it has been designed specifically for Spark and not for a specific language. This special client is called: Spark Cassandra connector.

Installing the connector

The Spark-Cassandra connector has its own GitHub repository, the latest stable version is in master, but we can access a special version through a particular branch.

The Cassandra connector project home page is: https://github.com/datastax/spark-cassandra-connector .

At the time of writing, the most stable connector version is 1.6.0.

The connector is basically a .jar file loaded when Spark starts. So, if you prefer to access...

Summary

NoSQL is not just hype, or a young technology; it is an alternative, with known limitations and capabilities. It is not an RDBMS killer. It's more like a younger brother who is slowly growing up and taking some of the burden. Acceptance is increasing and it will be even better as NoSQL solutions mature. Skepticism may be justified, but only for concrete reasons.

Since Cassandra is an easy and free working environment, suitable for application development, we recommended it, especially with additional utilities that ease and accelerate database administration.

Cassandra has some faults (for example, user authentication and authorization are still insufficiently supported in Windows environments) and should preferably be used when there is a need to store large amounts of data.

For start-up companies that need to manipulate large amounts of data with the aim of costs reduction, implementing Cassandra in a Linux environment is a must-have.

In the next chapter, we will explore Kafka, an...

The rest of the chapter is locked

You have been reading a chapter from

Fast Data Processing Systems with SMACK Stack

Published in: Dec 2016Publisher: PacktISBN-13: 9781786467201

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Raúl Estrada

Raúl Estrada has been a programmer since 1996 and a Java developer since 2001. He loves all topics related to computer science. With more than 15 years of experience in high-availability and enterprise software, he has been designing and implementing architectures since 2003. His specialization is in systems integration, and he mainly participates in projects related to the financial sector. He has been an enterprise architect for BEA Systems and Oracle Inc., but he also enjoys web, mobile, and game programming. Raúl is a supporter of free software and enjoys experimenting with new technologies, frameworks, languages, and methods. Raúl is the author of other Packt Publishing titles, such as Fast Data Processing Systems with SMACK and Apache Kafka Cookbook.
Read more about Raúl Estrada

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages