Fast Data Processing Systems with SMACK Stack

Chapter 1. An Introduction to SMACK

The goal of this chapter is to present data problems and scenarios solved by architecture. This chapter explains how every technology contributes to the SMACK stack. It also explains how this modern pipeline architecture solves many of the modern problems related to data-processing environments. Here we will know when to use SMACK and when it is not suitable. We will also touch on the new professional profiles created in the new data management era.

In this chapter we will cover the following topics:

Modern data-processing challenges
The data-processing pipeline architecture
SMACK technologies
Changing the data center operations
Data expert profiles
Is SMACK for me?

Modern data-processing challenges

We can enumerate four modern data-processing problems as follows:

Size matters: In modern times, data is getting bigger or, more accurately, the number of available data sources is increasing. In the previous decade, we could precisely identify our company's internal data sources: Customer Relationship Management (CRM), Point of Sale (POS), Enterprise Resource Planning (ERP), Supply Chain Management (SCM), and all our databases and legacy systems. Easy, a system that is not internal is external. Today, it is exactly the same, except not do the data sources multiply over time, the amount of information flowing from external systems is also growing at almost logarithmic rates. New data sources include social networks, banking systems, stock systems, tracking and geolocation systems, monitoring systems, sensors, and the Internet of Things; if a company's architecture is incapable of handling these use cases, then it can't respond to upcoming challenges.
Sample data: Obtaining a sample of production data is becoming more difficult. In the past, data analysts could have a fresh copy of production data on their desks almost daily. Today, it becomes increasingly more difficult, either because of the amount of data to be moved or by the expiration date; in many modern business models data from an hour ago is practically obsolete.
Data validity: The validity of an analysis becomes obsolete faster. Assuming that the fresh-copy problem is solved, how often is new data needed? Looking for a trend in the last year is different from looking for one in the last few hours. If samples from a year ago are needed, what is the frequency of these samples? Many modern businesses don't even have this information, or worse, they have it but it is only stored.
Data Return on Investment (ROI): Data analysis becomes too slow to get any return on investment from the info. Now, suppose you have solved the problems of sample data and data validity. The challenge is to be able to analyze information in a timely manner so that the return on investment of all our efforts is profitable. Many companies invest in data, but never get the analysis to increase their income.

We can enumerate modern data needs which are as follows:

Scalable infrastructure: Companies, every time, have to weigh the time and money spent. Scalability in a data center means the center should grow in proportion to the business growth. Vertical scalability involves adding more layers of processing. Horizontal scalability means that once a layer has more demands and requires more infrastructures, hardware can be added so that processing needs are met. One modern requirement is to have horizontal scaling with low-cost hardware.
Geographically dispersed data centers: Geographically centralized data centers are being displaced. This is because companies need to have multiple data centers in multiple locations for several reasons: cost, ease of administration, or access to users. This implies a huge challenge for data center management. On the other hand, data center unification is a complex task.
Allow data volumes to be scaled as the business needs: The volume of data must scale dynamically according to business demands. So, as you can have a lot of demand at a certain time of day, you can have high demand in certain geographic regions. Scaling should be dynamically possible in time and space especially horizontally.
Faster processing: Today, being able to work in real time is fundamental. We live in an age where data freshness matters many times more than the amount or size of data. If the data is not processed fast enough, it becomes stale quickly. Fresh information not only needs to be obtained in a fast way, it has to be processed quickly.
Complex processing: In the past, the data was smaller and simpler. Raw data doesn't help us much. The information must be processed by several layers, efficiently. The first layers are usually purely technical and the last layers mainly business-oriented. Processing complexity can kill of the best business ideas.
Constant data flow: For cost reasons, the number of data warehouses is decreasing. The era when data warehouses served just to store data is dying. Today, no one can afford data warehouses just to store information. Today, data warehouses are becoming very expensive and meaningless. The better business trend is towards flows or streams of data. Data no longer stagnates, it moves like large rivers. Make data analysis on big information torrents one of the objectives of modern businesses.
Visible, reproducible analysis: If we cannot reproduce phenomena, we cannot call ourselves scientists. Modern science data requires making reports and graphs in real time to take timely decisions. The aim of science data is to make effective predictions based on observation. The process should be visible and reproducible.

The data-processing pipeline architecture

If you ask several people from the information technology world, we agree on few things, except that we are always looking for a new acronym, and the year 2015 was no exception.

As this book title says, SMACK stands for Spark, Mesos, Akka, Cassandra, and Kafka. All these technologies are open source. And with the exception of Akka, all are Apache Software projects. This acronym was coined by Mesosphere, a company that bundles these technologies together in a product called Infinity, designed in collaboration with Cisco to solve some pipeline data challenges where the speed of response is fundamental, such as in fraud detection engines.

SMACK exists because one technology doesn't make an architecture. SMACK is a pipelined architecture model for data processing. A data pipeline is software that consolidates data from multiple sources and makes it available to be used strategically.

It is called a pipeline because each technology contributes with its characteristics to a processing line similar to a traditional industrial assembly line. In this context, our canonical reference architecture has four parts: storage, the message broker, the engine, and the hardware abstraction.

For example, Apache Cassandra alone solves some problems that a modern database can solve but, given its characteristics, leads the storage task in our reference architecture.

Similarly, Apache Kafka was designed to be a message broker, and by itself solves many problems in specific businesses; however, its integration with other tools deserves a special place in our reference architecture over its competitors.

The NoETL manifesto

The acronym ETL stands for Extract, Transform, Load. In the database data warehousing guide, Oracle says:

Designing and maintaining the ETL process is often considered one of the most difficult and resource intensive portions of a data warehouse project.

For more information, refer to http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm.

Contrary to many companies' daily operations, ETL is not a goal, it is a step, a series of unnecessary steps:

Each ETL step can introduce errors and risk
It can duplicate data after failover
Tools can cost millions of dollars
It decreases throughput
It increases complexity
It writes intermediary files
It parses and re-parses plain text
It duplicates the pattern over all our data centers

No ETL pipelines fit on the SMACK stack: Spark, Mesos, Akka, Cassandra, and Kafka. And if you use SMACK, make sure it's highly-available, resilient, and distributed.

A good sign you're having Etlitis is writing intermediary files. Files are useful in day to day work, but as data types they are difficult to handle. Some programmers advocate replacing a file system with a better API.

Removing the E in ETL: Instead of text dumps that you need to parse over multiple systems, Scala and Parquet technologies, for example, can work with binary data that remains strongly typed and represent a return to strong typing in the data ecosystem.

Removing the L in ETL: If data collection is backed by a distributed messaging system (Kafka, for example) you can do a real-time fan-out of the ingested data to all customers. No need to batch-load.

The T in ETL: From this architecture, each consumer can do their own transformations.

So, the modern tendency is: no more Greek letter architectures, no more ETL.

Lambda architecture

The academic definition is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods. The problem arises when we need to process data streams in real time.

Here, a special mention for two open source projects that allow batch processing and real-time stream processing in the same application: Apache Spark and Apache Flink. There is a battle between these two: Apache Spark is the solution led by Databricks, and Apache Flink is a solution led by data artisans.

For example, Apache Spark and Apache Cassandra meets two modern requirements described previously:

It handles a massive data stream in real time
It handles multiple and different data models from multiple data sources

Most lambda solutions, as mentioned, cannot meet these two needs at the same time. As a demonstration of power, using an architecture based only on these two technologies, Apache Spark is responsible for real-time analysis of both historical data and recent data obtained from the massive information torrent. All such information and analysis results are persisted in Apache Cassandra. So, in the case of failure we can recover real-time data from any point of time. With lambda architecture it's not always possible.

Hadoop

Hadoop was designed to transfer processing closer to the data to minimize the amount of data shuffled across the network. It was designed with data warehouse and batch problems in mind; it fits into the slow data category, where size, scope, and completeness of data are more important than the speed of response.

The analogy is the sea versus the waterfall. In a sea of information you have a huge amount of data, but it is a static, contained, motionless sea, perfect to do Batch processing without time pressures. In a waterfall you have a huge amount of data, dynamic, not contained, and in motion. In this context your data often has an expiration date; after time passes it is useless.

Some Hadoop adopters have been left questioning the true return on investment of their projects after running for a while; this is not a technological fault itself, but a case of whether it is the right application. SMACK has to be analyzed in the same way.

SMACK technologies

SMACK is about a full stack for pipeline data architecture--it's Spark, Mesos, Akka, Cassandra, and Kafka. Further on in the book, we will also talk about the most important factor: the integration of these technologies.

Pipeline data architecture is required for online data stream processing, but there are a lot of books talking about each technology separately. This book talks about the entire full stack and how to perform integration.

This book is a compendium of how to integrate these technologies in a pipeline data architecture.

We talk about the five main concepts of pipeline data architecture and how to integrate, replace, and reinforce every layer:

The engine: Apache Spark
The actor model: Akka
The storage: Apache Cassandra
The message broker: Apache Kafka
The hardware scheduler: Apache Mesos:

Figure 1.1 The SMACK pipeline architecture

Apache Spark

Spark is a fast and general engine for data processing on a large scale.

The Spark goals are:

Fast data processing
Ease of use
Supporting multiple languages
Supporting sophisticated analytics
Real-time stream processing
The ability to integrate with existing Hadoop data
An active and expanding community

Here is some chronology:

2009: Spark was initially started by Matei Zaharia at UC Berkeley AMPLab
2010: Spark is open-sourced under a BSD license
2013: Spark was donated to the Apache Software Foundation and its license to Apache 2.0
2014: Spark became a top-level Apache Project
2014: The engineering team at Databricks used Spark and set a new world record in large-scale sorting

As you are reading this book, you probably know all the Spark advantages. But here, we mention the most important:

Spark is faster than Hadoop: Spark makes efficient use of memory and it is able to execute equivalent jobs 10 to 100 times faster than Hadoop's MapReduce.
Spark is easier to use than Hadoop: You can develop in four languages: Scala, Java, Python, and recently R. Spark is implemented in Scala and Akka. When you work with collections in Spark it feels as if you are working with local Java, Scala, or Python collections. For practical reasons, in this book we only provide examples on Scala.
Spark scales differently than Hadoop: In Hadoop, you require experts in specialized Hardware to run monolithic Software. In Spark, you can easily increase your cluster horizontally with new nodes with non-expensive and non-specialized hardware. Spark has a lot of tools for you to manage your cluster.
Spark has it all in a single framework: The capabilities of coarse grained transformations, real-time data-processing functions, SQL-like handling of structured data, graph algorithms, and machine learning.

It is important to mention that Spark was made with Online Analytical Processing (OLAP) in mind, that is, batch jobs and data mining. Spark was not designed for Online Transaction Processing (OLTP), that is, fast and numerous atomic transactions; for this type of processing, we strongly advise the reader to consider the use of Erlang/Elixir.

Apache Spark has these main components:

Spark Core
Spark SQL
Spark Streaming
Spark MLIB
Spark Graph

The reader will find that each Spark component normally has several books. In this book, we just mention the essentials of Apache Spark to meet the SMACK stack.

In the SMACK stack, Apache Spark is the data-processing engine; it provides near real-time analysis of data (note the word near, because today processing petabytes of data cannot be done in real time).

Akka

Akka is an actor model implementation for JVM, it is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM.

The open source Akka toolkit was first released in 2009. It simplifies the construction of concurrent and distributed Java applications. Language bindings exist for both Java and Scala.

It is message-based and asynchronous; typically no mutable data is shared. It is primarily designed for actor-based concurrency:

Actors are arranged hierarchically
Each actor is created and supervised by its parent actor
Program failures treated as events are handled by an actor's supervisor
It is fault-tolerant
It has hierarchical supervision
Customizable failure strategies and detection
Asynchronous data passing
Parallelized
Adaptive and predictive
Load-balanced

Apache Cassandra

Apache Cassandra is a database with the scalability, availability, and performance necessary to compete with any database system in its class. We know that there are better database systems; however, Apache Cassandra is chosen because of its performance and its connectors built for Spark and Mesos.

In SMACK, Akka, Spark, and Kafka can store the data in Cassandra as a data layer. Also, Cassandra can handle operational data. Cassandra can also be used to serve data back to the application layer.

Cassandra is an open source distributed database that handles large amounts of data; originally started by Facebook in 2008, it became a top-level Apache Project from 2010.

Here are some Apache Cassandra features:

Extremely fast
Extremely scalable
Multi datacenters
There is no single point of failure
Can survive regional faults
Easy to operate
Automatic and configurable replication
Flexible data modeling
Perfect for real-time ingestion
Great community

Apache Kafka

Apache Kafka is a distributed commit log, an alternative to publish-subscribe messaging.

Kafka stands in SMACK as the ingestion point for data, possibly on the application layer. This takes data from one or more applications and streams it across to the next points in the stack.

Kafka is a high-throughput distributed messaging system that handles massive data load and avoids back pressure systems to handle floods. It inspects incoming data volumes, which is very important for distribution and partitioning across the nodes in the cluster.

Some Apache Kafka features:

High-performance distributed messaging
Decouples data pipelines
Massive data load handling
Supports a massive number of consumers
Distribution and partitioning between cluster nodes
Broker automatic failover

Apache Mesos

Mesos is a distributed systems kernel. Mesos abstracts all the computer resources (CPU, memory, storage) away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to be built easily and run effectively.

Mesos was build using Linux kernel principles and was first presented in 2009 (with the name Nexus). Later in 2011, it was presented by Matei Zaharia.

Mesos is the foundation of several frameworks; the main three are:

Apache Aurora
Chronos
Marathon

In SMACK, Mesos' task is to orchestrate the components and manage resources used.

Changing the data center operations

And here is the point where data processing changes data center operation.

From scale-up to scale-out

Throughout businesses we are moving from specialized, proprietary, and typically expensive supercomputers to the deployment of clusters of commodity machines connected with a low cost network.

The Total Cost of Ownership (TCO) determines the fate, quality, and size of a DataCenter. If the business is small, the DataCenter should be small; as the business demands, the DataCenter will grow or shrink.

Currently, one common practice is to create a dedicated cluster for each technology. This means you have a Spark cluster, a Kafka cluster, a Storm cluster, a Cassandra cluster, and so on, because the overall TCO tends to increase.

The open-source predominance

Modern organizations adopt open source to avoid two old and annoying dependencies: vendor lock-in and external entity bug fixing.

In the past, the rules were dictated from the classically large high-tech enterprises or monopolies. Today, the rules come from the people, for the people; transparency is ensured through community-defined APIs and various bodies, such as the Apache Software Foundation or the Eclipse Foundation, which provide guidelines, infrastructure, and tooling for the sustainable and fair advancement of these technologies.

There is no such thing as a free lunch. In the past, larger enterprises used to hire big companies in order to be able to blame and sue someone in the case of failure. Modern industries should take the risk and invest in training their people in open technologies.

Data store diversification

The dominant and omnipotent era of the relational database is challenged by the proliferation of NoSQL .

You have to deal with the consequences: systems of recording determination, synchronizing different stores, and correct data store selection.

Data gravity and data locality

Data gravity is related to considering the overall cost associated with data transfer, in terms of volume and tooling, for example, trying to restore hundreds of terabytes in a disaster recovery case.

Data locality is the idea of bringing the computation to the data rather than the data to the computation. As a rule of thumb, the more different services you have on the same node, the better prepared you are.

DevOps rules

DevOps refers to the best practices for collaboration between the software development and operational sides of a company.

The developer team should have the same environment for local testing as is used in production. For example, Spark allows you to go from testing to cluster submission.

The tendency is to containerize the entire production pipeline.

Data expert profiles

Well, first we will classify people into four groups based on skill: data architect, data analyst, data engineer, and data scientist.

Usually, data skills are separated into two broad categories:

Engineering skills: All the DevOps (yes, DevOps is the new black): setting up servers and clusters, operating systems, write/optimize/distribute queries, network protocol knowledge, programming, and all the stuff related to computer science
Analytical skills: All mathematical knowledge: statistics, multivariable analysis, matrix algebra, data mining, machine learning, and so on.

Data analysts and data scientists have different skills but usually have the same mission in the enterprise.

Data engineers and data architects have the same skills but usually different work profiles.

Data architects

Large enterprises collect and generate a lot of data from different sources:

Internal sources: Owned systems, for example, CRM, HRM, application servers, web server logs, databases, and so on.
External sources: For example, social network platforms (WhatsApp, Twitter, Facebook, Instagram), stock market feeds, GPS clients, and so on.

Data architects:

Understand all these data sources and develop a plan for collecting, integrating, centralizing, and maintaining all the data
Know the relationship between data and current operations, and understand the effects that any process change has on the data used in the organization
Have an end-to-end vision of the processes, and see how a logical design maps a physical design, and how the data flows through every stage in the organization
Design data models (for example, relational databases)
Develop strategies for all data lifecycles: acquisition, storage, recovery, cleaning, and so on

Data engineers

A data engineer is a hardcore engineer who knows the internals of the data engines (for example, database software).

Data engineers:

Can install all the infrastructure (database systems, file systems)
Write complex queries (SQL and NoSQL)
Scale horizontally to multiple machines and clusters
Ensure backups and design and execute disaster recovery plans
Usually have low-level expertise in different data engines and database software

Data analysts

Their primary tasks are the compilation and analysis of numerical information.

Data analysts:

Have computer science and business knowledge
Have analytical insights into all the organization's data
Know which information makes sense to the enterprise
Translate all this into decent reports so the non-technical people can understand and make decisions
Do not usually work with statistics
Are present (but specialized) in mid-sized organizations for example, sales analysts, marketing analysts, quality analysts, and so on
Can figure out new strategies and report to the decision makers

Data scientists

This is a modern phenomenon and is usually associated with modern data. Their mission is the same as that of a data analyst but, when the frequency, velocity, or volume of data crosses a certain level, this position has specific and sophisticated skills to get those insights out.

Data scientists:

Have overlapping skills, including but not limited to: Database system engineering (DB engines, SQL, NoSQL), big data systems handling (Hadoop, Spark), computer language knowledge (R, Python, Scala), mathematics (statistics, multivariable analysis, matrix algebra), data mining, machine learning, and so on
Explore and examine data from multiple heterogeneous data sources (unlike data analysts)
Can sift through all incoming data to discover a previously hidden insight
Can make inductions, deductions, and abductions of data to solve a business problem or find a business pattern (usually data analysts just make inductions of from data)
The best don't just address known business problems, they find patterns to solve new problems and add value to the organization

As you can deduce, this book is mainly focused on the data architect and data engineer profiles. But if you're an enthusiastic data scientist looking for more wisdom, we hope to be useful to you, too.