Reader small image

You're reading from  Learning Spark SQL

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785888359
Edition1st Edition
Languages
Right arrow

Chapter 12. Spark SQL in Large-Scale Application Architectures

In this book, we started with the basics of Spark SQL and its components, and its role in Spark applications. Later, we presented a series of chapters focusing on its usage in various types of applications. With DataFrame/Dataset API and the Catalyst optimizer at the heart of Spark SQL, it is no surprise that it plays a key role in all applications based on the Spark technology stack. These applications include large-scale machine learning, large-scale graphs, and deep learning applications. Additionally, we presented Spark SQL-based Structured Streaming applications that operate in complex environments as continuous applications. In this chapter, we will explore application architectures that leverage Spark modules and Spark SQL in real-world applications.     

More specifically, we will cover key architectural components and patterns in large-scale applications that architects and designers will find useful as a starting point...

Understanding Spark-based application architectures


Apache Spark is an emerging platform that leverages distributed storage and processing frameworks to support querying, reporting, analytics, and intelligent applications at scale. Spark SQL has the necessary features, and supports the key mechanisms required, to access data across a set of data sources and formats, and prepare it for downstream applications either with low-latency streaming data or high-throughput historical data stores. The following figure shows a high-level architecture that incorporates these requirements in typical Spark-based batch and streaming applications:

Additionally, as organizations start employing big data and NoSQL-based solutions across a number of projects, a data layer comprising RDBMSes alone is no longer considered the best fit for all the use-cases in a modern enterprise application. RDBMS-only based architectures illustrated in the following figure are rapidly disappearing across the industry, in order...

Understanding the Lambda architecture


The Lambda architectural pattern attempts to combine the best of worlds--batch processing and stream processing. This pattern consists of several layers: Batch Layer (ingests and processes data on persistent storage such as HDFS and S3), Speed Layer (ingests processes streaming data that has not been processed by the Batch Layer yet), and the Serving Layer (combines outputs from the Batch and Speed Layers to present merged results). This is a popular architecture in Spark environments because it can support both the Batch and Speed Layer implementations with minimal code differences between the two.

The given figure depicts the Lambda architecture as a combination of batch processing and stream processing:

The next figure an implementation the Lambda architecture AWS services (Amazon Kinesis, Amazon S3 Storage, Amazon EMR, Amazon DynamoDB, and so on) and Spark:

Note

For more details on the AWS implementation Lambda architecture, refer to https:/...

Understanding the Kappa Architecture


The Kappa Architecture is simpler than Lambda pattern as it comprises the Speed and Serving Layers only. All the computations occur as stream processing and there are no batch re-computations done on the full Dataset. Recomputations are only done to support changes and new requirements.

Typically, the incoming real-time data stream is processed in memory is persisted in a database or HDFS to support queries, as illustrated in the following figure:

The Kappa Architecture can be realized by using Apache Spark combined with a queuing solution, such as Apache Kafka. If the data retention times are bound to several days to weeks, then Kafka could also be used to retain the data for the limited period of time.

In the next few sections, we will introduce a few hands-on exercises using Apache Spark, Scala, and Apache Kafka that are very useful in the real-world applications development context. We will start by using Spark SQL and Structured Streaming to implement...

Design considerations for building scalable stream processing applications


Building robust stream processing applications is challenging. The typical associated with stream processing include the following:

  • Complex Data: Diverse data formats and the of data create significant challenges streaming applications. Typically, the data is available in various formats, such as JSON, CSV, AVRO, and binary. Additionally, dirty data, or late arriving, and out-of-order data, can make the design of such applications extremely complex.
  • Complex workloads: Streaming applications to support a diverse set of application requirements, including interactive queries, machine learning pipelines, and so on.
  • Complex systems: With diverse systems, including Kafka, S3, Kinesis, and so on, system failures can lead to significant reprocessing or bad results.

Steam processing using Spark SQL can be fast, scalable, and fault-tolerant. It provides an extensive set of high-level APIs to deal with complex data and workloads...

Building robust ETL pipelines using Spark SQL


ETL pipelines execute a of transformations on source data to cleansed, structured, and ready-for-use output by subsequent processing components. The transformations required to be applied on the source will depend on nature of the data. The input or source data can be structured (RDBMS, Parquet, and so on), semi-structured (CSV, JSON, and so on) or unstructured data (text, audio, video, and so on).  After being processed through such pipelines, the data is ready for downstream data processing, modeling, analytics, reporting, and so on.

The following figure illustrates an application architecture in which the input data from Kafka, and other sources such as application and server logs, are cleansed and transformed (using an ETL pipeline) before being stored in an enterprise data store. This data store can eventually feed other applications (via Kafka), support interactive queries, store subsets or views of the data in serving databases, train...

Implementing a scalable monitoring solution


Building a scalable monitoring function for large-scale deployments can be challenging as there could be billions of data points captured each day. Additionally, the volume of and the number of metrics can be difficult to manage without a suitable big data platform with streaming and visualization support.

Voluminous logs collected from applications, servers, network devices, and so on are processed to provide real-time monitoring that help detect errors, warnings, failures, and other issues. Typically, various daemons, services, and tools are used to collect/send log records to the monitoring system. For example, log entries in the JSON format can be sent to Kafka queues or Amazon Kinesis. These JSON records can then be stored on S3 as files and/or streamed to be analyzed in real time (in a Lambda architecture implementation). Typically, an ETL pipeline is run to cleanse the log data, transform it into a more structured form, and then it into...

Deploying Spark machine learning pipelines


The following figure illustrates a learning pipeline at a conceptual level. However, real-life ML pipelines are a lot more complicated, with several models being trained, tuned, combined, and so on:

The next figure shows the core elements of a typical machine learning application split into two parts: the modeling, including model training, and the deployed model (used on streaming data to output the results):

Typically, data scientists experiment or do their modeling work in Python and/or R. Their work is then reimplemented in Java/Scala before deployment in a production environment. Enterprise production environments often consist of web servers, application servers, databases, middleware, and so on. The conversion of prototypical models to production-ready models results in additional design and development effort that lead to delays in rolling out updated models.

We can use Spark MLlib 2.x model serialization to directly use the models and pipelines...

Using cluster managers


In section, we will briefly discuss and at a conceptual level. The framework can be deployed through Apache Mesos, YARN, Spark Standalone, or the Kubernetes cluster manager, as depicted:

Mesos can enable easy scalability and replication of data, and is a good unified cluster management solution for heterogeneous workloads.

To use Mesos from Spark, the Spark binaries should be accessible by Mesos and the Spark driver configured to connect to Mesos. Alternatively, you can also install Spark binaries on all the Mesos slaves. The driver creates a job and then issues the tasks for scheduling, while Mesos determines the machines to handle them.

Spark can run over Mesos in two modes: coarse-grained (the default) and fine-grained (deprecated in Spark 2.0.0). In the coarse-grained mode, each Spark executor runs as a single Mesos task. This mode has significantly lower start up overheads, but reserves Mesos resources for the duration of the application. Mesos also supports...

Summary


In this chapter, we presented several Spark SQL-based application architectures for building highly-scalable applications. We explored the main concepts and challenges in batch processing and stream processing. We discussed the features of Spark SQL that can help in building robust ETL pipelines. We also presented some code towards building a scalable monitoring application. Additionally, we explored an efficient deployment technique for machine learning pipelines, and some basic concepts involved in using cluster managers such as Mesos and Kubernetes.

In conclusion, this book attempts to help you build a strong foundation in Spark SQL and Scala. However, there are still many areas that you can explore in greater depth to build deeper expertise. Depending on your specific domain, the nature of data and problems could vary widely and your approach to solving them would typically encompass one or more areas described in this book. However, in all cases EDA and data munging skills will...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Spark SQL
Published in: Sep 2017Publisher: PacktISBN-13: 9781785888359
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime