Packt+ | Advance your knowledge in tech

You're reading from Fast Data Processing Systems with SMACK Stack

Product type Book

Published in Dec 2016

Publisher Packt

ISBN-13 9781786467201

Pages 376 pages

Edition 1st Edition

Languages

Scala

Concepts

Data Processing

Author (1):

Raúl Estrada

Table of Contents (15) Chapters

Fast Data Processing Systems with SMACK Stack

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. An Introduction to SMACK

2. The Model - Scala and Akka

3. The Engine - Apache Spark

4. The Storage - Apache Cassandra

5. The Broker - Apache Kafka

6. The Manager - Apache Mesos

7. Study Case 1 - Spark and Cassandra

8. Study Case 2 - Connectors

9. Study Case 3 - Mesos and Docker

The data-processing pipeline architecture

If you ask several people from the information technology world, we agree on few things, except that we are always looking for a new acronym, and the year 2015 was no exception.

As this book title says, SMACK stands for Spark, Mesos, Akka, Cassandra, and Kafka. All these technologies are open source. And with the exception of Akka, all are Apache Software projects. This acronym was coined by Mesosphere, a company that bundles these technologies together in a product called Infinity, designed in collaboration with Cisco to solve some pipeline data challenges where the speed of response is fundamental, such as in fraud detection engines.

SMACK exists because one technology doesn't make an architecture. SMACK is a pipelined architecture model for data processing. A data pipeline is software that consolidates data from multiple sources and makes it available to be used strategically.

It is called a pipeline because each technology contributes with its characteristics to a processing line similar to a traditional industrial assembly line. In this context, our canonical reference architecture has four parts: storage, the message broker, the engine, and the hardware abstraction.

For example, Apache Cassandra alone solves some problems that a modern database can solve but, given its characteristics, leads the storage task in our reference architecture.

Similarly, Apache Kafka was designed to be a message broker, and by itself solves many problems in specific businesses; however, its integration with other tools deserves a special place in our reference architecture over its competitors.

The NoETL manifesto

The acronym ETL stands for Extract, Transform, Load. In the database data warehousing guide, Oracle says:

Designing and maintaining the ETL process is often considered one of the most difficult and resource intensive portions of a data warehouse project.

For more information, refer to http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm.

Contrary to many companies' daily operations, ETL is not a goal, it is a step, a series of unnecessary steps:

Each ETL step can introduce errors and risk
It can duplicate data after failover
Tools can cost millions of dollars
It decreases throughput
It increases complexity
It writes intermediary files
It parses and re-parses plain text
It duplicates the pattern over all our data centers

No ETL pipelines fit on the SMACK stack: Spark, Mesos, Akka, Cassandra, and Kafka. And if you use SMACK, make sure it's highly-available, resilient, and distributed.

A good sign you're having Etlitis is writing intermediary files. Files are useful in day to day work, but as data types they are difficult to handle. Some programmers advocate replacing a file system with a better API.

Removing the E in ETL: Instead of text dumps that you need to parse over multiple systems, Scala and Parquet technologies, for example, can work with binary data that remains strongly typed and represent a return to strong typing in the data ecosystem.

Removing the L in ETL: If data collection is backed by a distributed messaging system (Kafka, for example) you can do a real-time fan-out of the ingested data to all customers. No need to batch-load.

The T in ETL: From this architecture, each consumer can do their own transformations.

So, the modern tendency is: no more Greek letter architectures, no more ETL.

Lambda architecture

The academic definition is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. The problem arises when we need to process data streams in real time.

Here, a special mention for two open source projects that allow batch processing and real-time stream processing in the same application: Apache Spark and Apache Flink. There is a battle between these two: Apache Spark is the solution led by Databricks, and Apache Flink is a solution led by data artisans.

For example, Apache Spark and Apache Cassandra meets two modern requirements described previously:

It handles a massive data stream in real time
It handles multiple and different data models from multiple data sources

Most lambda solutions, as mentioned, cannot meet these two needs at the same time. As a demonstration of power, using an architecture based only on these two technologies, Apache Spark is responsible for real-time analysis of both historical data and recent data obtained from the massive information torrent. All such information and analysis results are persisted in Apache Cassandra. So, in the case of failure we can recover real-time data from any point of time. With lambda architecture it's not always possible.

Hadoop

Hadoop was designed to transfer processing closer to the data to minimize the amount of data shuffled across the network. It was designed with data warehouse and batch problems in mind; it fits into the slow data category, where size, scope, and completeness of data are more important than the speed of response.

The analogy is the sea versus the waterfall. In a sea of information you have a huge amount of data, but it is a static, contained, motionless sea, perfect to do Batch processing without time pressures. In a waterfall you have a huge amount of data, dynamic, not contained, and in motion. In this context your data often has an expiration date; after time passes it is useless.

Some Hadoop adopters have been left questioning the true return on investment of their projects after running for a while; this is not a technological fault itself, but a case of whether it is the right application. SMACK has to be analyzed in the same way.