Packt+ | Advance your knowledge in tech

You're reading from Learning Spark SQL

Product typeBook

Published inSep 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781785888359

Edition1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Data Streaming

Author (1)

Aurobindo Sarkar

Chapter 12. Spark SQL in Large-Scale Application Architectures

In this book, we started with the basics of Spark SQL and its components, and its role in Spark applications. Later, we presented a series of chapters focusing on its usage in various types of applications. With DataFrame/Dataset API and the Catalyst optimizer at the heart of Spark SQL, it is no surprise that it plays a key role in all applications based on the Spark technology stack. These applications include large-scale machine learning, large-scale graphs, and deep learning applications. Additionally, we presented Spark SQL-based Structured Streaming applications that operate in complex environments as continuous applications. In this chapter, we will explore application architectures that leverage Spark modules and Spark SQL in real-world applications.

More specifically, we will cover key architectural components and patterns in large-scale applications that architects and designers will find useful as a starting point...

Understanding Spark-based application architectures

Apache Spark is an emerging platform that leverages distributed storage and processing frameworks to support querying, reporting, analytics, and intelligent applications at scale. Spark SQL has the necessary features, and supports the key mechanisms required, to access data across a set of data sources and formats, and prepare it for downstream applications either with low-latency streaming data or high-throughput historical data stores. The following figure shows a high-level architecture that incorporates these requirements in typical Spark-based batch and streaming applications:

Additionally, as organizations start employing big data and NoSQL-based solutions across a number of projects, a data layer comprising RDBMSes alone is no longer considered the best fit for all the use-cases in a modern enterprise application. RDBMS-only based architectures illustrated in the following figure are rapidly disappearing across the industry, in order...

Understanding the Lambda architecture

The Lambda architectural pattern attempts to combine the best of worlds--batch processing and stream processing. This pattern consists of several layers: Batch Layer (ingests and processes data on persistent storage such as HDFS and S3), Speed Layer (ingests processes streaming data that has not been processed by the Batch Layer yet), and the Serving Layer (combines outputs from the Batch and Speed Layers to present merged results). This is a popular architecture in Spark environments because it can support both the Batch and Speed Layer implementations with minimal code differences between the two.

The given figure depicts the Lambda architecture as a combination of batch processing and stream processing:

The next figure an implementation the Lambda architecture AWS services (Amazon Kinesis, Amazon S3 Storage, Amazon EMR, Amazon DynamoDB, and so on) and Spark:

Note

For more details on the AWS implementation Lambda architecture, refer to https:/...

Understanding the Kappa Architecture

The Kappa Architecture is simpler than Lambda pattern as it comprises the Speed and Serving Layers only. All the computations occur as stream processing and there are no batch re-computations done on the full Dataset. Recomputations are only done to support changes and new requirements.

Typically, the incoming real-time data stream is processed in memory is persisted in a database or HDFS to support queries, as illustrated in the following figure:

The Kappa Architecture can be realized by using Apache Spark combined with a queuing solution, such as Apache Kafka. If the data retention times are bound to several days to weeks, then Kafka could also be used to retain the data for the limited period of time.

In the next few sections, we will introduce a few hands-on exercises using Apache Spark, Scala, and Apache Kafka that are very useful in the real-world applications development context. We will start by using Spark SQL and Structured Streaming to implement...

Design considerations for building scalable stream processing applications

Building robust stream processing applications is challenging. The typical associated with stream processing include the following:

Complex Data: Diverse data formats and the of data create significant challenges streaming applications. Typically, the data is available in various formats, such as JSON, CSV, AVRO, and binary. Additionally, dirty data, or late arriving, and out-of-order data, can make the design of such applications extremely complex.
Complex workloads: Streaming applications to support a diverse set of application requirements, including interactive queries, machine learning pipelines, and so on.
Complex systems: With diverse systems, including Kafka, S3, Kinesis, and so on, system failures can lead to significant reprocessing or bad results.

Steam processing using Spark SQL can be fast, scalable, and fault-tolerant. It provides an extensive set of high-level APIs to deal with complex data and workloads...

Building robust ETL pipelines using Spark SQL

ETL pipelines execute a of transformations on source data to cleansed, structured, and ready-for-use output by subsequent processing components. The transformations required to be applied on the source will depend on nature of the data. The input or source data can be structured (RDBMS, Parquet, and so on), semi-structured (CSV, JSON, and so on) or unstructured data (text, audio, video, and so on). After being processed through such pipelines, the data is ready for downstream data processing, modeling, analytics, reporting, and so on.

The following figure illustrates an application architecture in which the input data from Kafka, and other sources such as application and server logs, are cleansed and transformed (using an ETL pipeline) before being stored in an enterprise data store. This data store can eventually feed other applications (via Kafka), support interactive queries, store subsets or views of the data in serving databases, train...

Implementing a scalable monitoring solution

Building a scalable monitoring function for large-scale deployments can be challenging as there could be billions of data points captured each day. Additionally, the volume of and the number of metrics can be difficult to manage without a suitable big data platform with streaming and visualization support.

Voluminous logs collected from applications, servers, network devices, and so on are processed to provide real-time monitoring that help detect errors, warnings, failures, and other issues. Typically, various daemons, services, and tools are used to collect/send log records to the monitoring system. For example, log entries in the JSON format can be sent to Kafka queues or Amazon Kinesis. These JSON records can then be stored on S3 as files and/or streamed to be analyzed in real time (in a Lambda architecture implementation). Typically, an ETL pipeline is run to cleanse the log data, transform it into a more structured form, and then it into...

Deploying Spark machine learning pipelines

The following figure illustrates a learning pipeline at a conceptual level. However, real-life ML pipelines are a lot more complicated, with several models being trained, tuned, combined, and so on:

The next figure shows the core elements of a typical machine learning application split into two parts: the modeling, including model training, and the deployed model (used on streaming data to output the results):

Typically, data scientists experiment or do their modeling work in Python and/or R. Their work is then reimplemented in Java/Scala before deployment in a production environment. Enterprise production environments often consist of web servers, application servers, databases, middleware, and so on. The conversion of prototypical models to production-ready models results in additional design and development effort that lead to delays in rolling out updated models.

We can use Spark MLlib 2.x model serialization to directly use the models and pipelines...

Using cluster managers

In section, we will briefly discuss and at a conceptual level. The framework can be deployed through Apache Mesos, YARN, Spark Standalone, or the Kubernetes cluster manager, as depicted:

Mesos can enable easy scalability and replication of data, and is a good unified cluster management solution for heterogeneous workloads.

To use Mesos from Spark, the Spark binaries should be accessible by Mesos and the Spark driver configured to connect to Mesos. Alternatively, you can also install Spark binaries on all the Mesos slaves. The driver creates a job and then issues the tasks for scheduling, while Mesos determines the machines to handle them.

Spark can run over Mesos in two modes: coarse-grained (the default) and fine-grained (deprecated in Spark 2.0.0). In the coarse-grained mode, each Spark executor runs as a single Mesos task. This mode has significantly lower start up overheads, but reserves Mesos resources for the duration of the application. Mesos also supports...

Summary

In this chapter, we presented several Spark SQL-based application architectures for building highly-scalable applications. We explored the main concepts and challenges in batch processing and stream processing. We discussed the features of Spark SQL that can help in building robust ETL pipelines. We also presented some code towards building a scalable monitoring application. Additionally, we explored an efficient deployment technique for machine learning pipelines, and some basic concepts involved in using cluster managers such as Mesos and Kubernetes.

In conclusion, this book attempts to help you build a strong foundation in Spark SQL and Scala. However, there are still many areas that you can explore in greater depth to build deeper expertise. Depending on your specific domain, the nature of data and problems could vary widely and your approach to solving them would typically encompass one or more areas described in this book. However, in all cases EDA and data munging skills will...

The rest of the chapter is locked

You have been reading a chapter from

Learning Spark SQL

Published in: Sep 2017Publisher: PacktISBN-13: 9781785888359

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Aurobindo Sarkar

Other recommended products

Related to this chapter

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

BookFeb 2017274 pages

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

BookMar 2019182 pages

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

BookMay 2019298 pages

Machine Learning with Spark

Spark ML is the machine learning module of Spark. It uses in-memory RDDs to process machine learning models faster for clustering, classification, and regression.

BookApr 2017532 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Mastering Apache Spark 2.x

Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and more. This book will familiarize you with the newest features in Apache Spark 2.x, and take you through an exciting journey of complex Big Data processing, analytics, streaming analytics as well as advanced machine learning with Apache Spark. During the course of the book, you will leverage different functionalities and modules of Apache Spark such as Spark SQL, Spark MLlib, Spark Streaming, SparkML and more, to build efficient data processing solutions. By the end of this book, you will have all the necessary knowledge to use Apache Spark effectively in your day to day tasks.

BookJul 2017354 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

Modern Scala Projects

Scala is a multipurpose programming language, especially for analyzing large datasets without impacting the application performance. Its functional libraries can interact with databases and build scalable frameworks that create robust data pipelines. This book showcases how you can use Scala and its constructs to meet specific project demands.

BookJul 2018334 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages