Learning Spark SQL

Design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API
Preview in Mapt

Learning Spark SQL

Aurobindo Sarkar

Design, implement, and deliver successful streaming applications, machine learning pipelines and graph applications using Spark SQL API
Mapt Subscription
FREE
$29.99/m after trial
eBook
$10.00
RRP $43.99
Save 77%
Print + eBook
$54.99
RRP $54.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$10.00
$54.99
$29.99 p/m after trial
RRP $43.99
RRP $54.99
Subscription
eBook
Print + eBook
Start 14 Day Trial

Frequently bought together


Learning Spark SQL Book Cover
Learning Spark SQL
$ 43.99
$ 10.00
Learning PySpark Book Cover
Learning PySpark
$ 35.99
$ 10.00
Buy 2 for $20.00
Save $59.98
Add to Cart

Book Details

ISBN 139781785888359
Paperback452 pages

Book Description

In the past year, Apache Spark has been increasingly adopted for the development of distributed applications. Spark SQL APIs provide an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Hence, understanding the design and implementation best practices before you start your project will help you avoid these problems.

This book gives an insight into the engineering practices used to design and build real-world, Spark-based applications. The book's hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

It starts by familiarizing you with data exploration and data munging tasks using Spark SQL and Scala. Extensive code examples will help you understand the methods used to implement typical use-cases for various types of applications. You will get a walkthrough of the key concepts and terms that are common to streaming, machine learning, and graph applications. You will also learn key performance-tuning details including Cost Based Optimization (Spark 2.2) in Spark SQL applications. Finally, you will move on to learning how such systems are architected and deployed for a successful delivery of your project.

Table of Contents

Chapter 1: Getting Started with Spark SQL
What is Spark SQL?
Introducing SparkSession
Understanding Spark SQL concepts
Using Spark SQL in streaming applications
Summary
Chapter 2: Using Spark SQL for Processing Structured and Semistructured Data
Understanding data sources in Spark applications
Using Spark with relational databases
Using Spark with MongoDB (NoSQL database)
Using Spark with JSON data
Using Spark with Avro files
Using Spark with Parquet files
Defining and using custom data sources in Spark
Summary
Chapter 3: Using Spark SQL for Data Exploration
Introducing Exploratory Data Analysis (EDA)
Using Spark SQL for basic data analysis
Visualizing data with Apache Zeppelin
Sampling data with Spark SQL APIs
Using Spark SQL for creating pivot tables
Summary
Chapter 4: Using Spark SQL for Data Munging
Introducing data munging
Exploring data munging techniques
Munging textual data
Munging time series data
Dealing with variable length records
Preparing data for machine learning
Summary
Chapter 5: Using Spark SQL in Streaming Applications
Introducing streaming data applications
Building Spark streaming applications
Using Kafka with Spark Structured Streaming
Writing a receiver for a custom data source
Summary
Chapter 6: Using Spark SQL in Machine Learning Applications
Introducing machine learning applications
Introducing feature engineering
Implementing a Spark ML classification model
Introducing Spark ML tools and utilities
Implementing a Spark ML clustering model
Summary
Chapter 7: Using Spark SQL in Graph Applications
Introducing large-scale graph applications
Exploring graphs using GraphFrames
Analyzing JSON input modeled as a graph 
Processing graphs containing multiple types of relationships
Understanding GraphFrame internals
Summary
Chapter 8: Using Spark SQL with SparkR
Introducing SparkR
Understanding the SparkR architecture
Understanding SparkR DataFrames
Using SparkR for EDA and data munging tasks
Using SparkR for computing summary statistics
Using SparkR for data visualization
Using SparkR for machine learning
Summary
Chapter 9: Developing Applications with Spark SQL
Introducing Spark SQL applications
Understanding text analysis applications
Understanding themes in document corpuses
Using Naive Bayes classifiers
Developing a machine learning application
Summary
Chapter 10: Using Spark SQL in Deep Learning Applications
Introducing neural networks
Introducing deep learning in Spark
Understanding Supervised learning
Using deep neural networks for language processing
Introducing autoencoders
Summary
Chapter 11: Tuning Spark SQL Components for Performance
Introducing performance tuning in Spark SQL
Understanding DataFrame/Dataset APIs
Understanding Catalyst optimizations
Visualizing Spark application execution
Cost-based optimizer in Apache Spark 2.2
Understanding multi-way JOIN ordering optimization
Understanding performance improvements using whole-stage code generation
Summary
Chapter 12: Spark SQL in Large-Scale Application Architectures
Understanding Spark-based application architectures
Understanding the Lambda architecture
Understanding the Kappa Architecture
Design considerations for building scalable stream processing applications
Building robust ETL pipelines using Spark SQL
Implementing a scalable monitoring solution
Deploying Spark machine learning pipelines
Using cluster managers
Summary

What You Will Learn

  • Familiarize yourself with Spark SQL programming, including working with DataFrame/Dataset API and SQL
  • Perform a series of hands-on exercises with different types of data sources, including CSV, JSON, Avro, MySQL, and MongoDB
  • Perform data quality checks, data visualization, and basic statistical analysis tasks.
  • Perform data munging tasks on publically available datasets.
  • Learn how to use Spark SQL and Apache Kafka to build streaming applications
  • Learn key performance-tuning tips and tricks in Spark SQL applications
  • Learn key architectural components and patterns in large-scale Spark SQL applications

Authors

Table of Contents

Chapter 1: Getting Started with Spark SQL
What is Spark SQL?
Introducing SparkSession
Understanding Spark SQL concepts
Using Spark SQL in streaming applications
Summary
Chapter 2: Using Spark SQL for Processing Structured and Semistructured Data
Understanding data sources in Spark applications
Using Spark with relational databases
Using Spark with MongoDB (NoSQL database)
Using Spark with JSON data
Using Spark with Avro files
Using Spark with Parquet files
Defining and using custom data sources in Spark
Summary
Chapter 3: Using Spark SQL for Data Exploration
Introducing Exploratory Data Analysis (EDA)
Using Spark SQL for basic data analysis
Visualizing data with Apache Zeppelin
Sampling data with Spark SQL APIs
Using Spark SQL for creating pivot tables
Summary
Chapter 4: Using Spark SQL for Data Munging
Introducing data munging
Exploring data munging techniques
Munging textual data
Munging time series data
Dealing with variable length records
Preparing data for machine learning
Summary
Chapter 5: Using Spark SQL in Streaming Applications
Introducing streaming data applications
Building Spark streaming applications
Using Kafka with Spark Structured Streaming
Writing a receiver for a custom data source
Summary
Chapter 6: Using Spark SQL in Machine Learning Applications
Introducing machine learning applications
Introducing feature engineering
Implementing a Spark ML classification model
Introducing Spark ML tools and utilities
Implementing a Spark ML clustering model
Summary
Chapter 7: Using Spark SQL in Graph Applications
Introducing large-scale graph applications
Exploring graphs using GraphFrames
Analyzing JSON input modeled as a graph 
Processing graphs containing multiple types of relationships
Understanding GraphFrame internals
Summary
Chapter 8: Using Spark SQL with SparkR
Introducing SparkR
Understanding the SparkR architecture
Understanding SparkR DataFrames
Using SparkR for EDA and data munging tasks
Using SparkR for computing summary statistics
Using SparkR for data visualization
Using SparkR for machine learning
Summary
Chapter 9: Developing Applications with Spark SQL
Introducing Spark SQL applications
Understanding text analysis applications
Understanding themes in document corpuses
Using Naive Bayes classifiers
Developing a machine learning application
Summary
Chapter 10: Using Spark SQL in Deep Learning Applications
Introducing neural networks
Introducing deep learning in Spark
Understanding Supervised learning
Using deep neural networks for language processing
Introducing autoencoders
Summary
Chapter 11: Tuning Spark SQL Components for Performance
Introducing performance tuning in Spark SQL
Understanding DataFrame/Dataset APIs
Understanding Catalyst optimizations
Visualizing Spark application execution
Cost-based optimizer in Apache Spark 2.2
Understanding multi-way JOIN ordering optimization
Understanding performance improvements using whole-stage code generation
Summary
Chapter 12: Spark SQL in Large-Scale Application Architectures
Understanding Spark-based application architectures
Understanding the Lambda architecture
Understanding the Kappa Architecture
Design considerations for building scalable stream processing applications
Building robust ETL pipelines using Spark SQL
Implementing a scalable monitoring solution
Deploying Spark machine learning pipelines
Using cluster managers
Summary

Book Details

ISBN 139781785888359
Paperback452 pages
Read More

Read More Reviews

Recommended for You

Learning PySpark Book Cover
Learning PySpark
$ 35.99
$ 10.00
Mastering Apache Spark 2.x - Second Edition Book Cover
Mastering Apache Spark 2.x - Second Edition
$ 39.99
$ 10.00
Apache Spark 2.x Cookbook Book Cover
Apache Spark 2.x Cookbook
$ 39.99
$ 10.00
Python Machine Learning - Second Edition Book Cover
Python Machine Learning - Second Edition
$ 31.99
$ 10.00
Data Science Algorithms in a Week Book Cover
Data Science Algorithms in a Week
$ 31.99
$ 10.00
Building Data Streaming Applications with Apache Kafka Book Cover
Building Data Streaming Applications with Apache Kafka
$ 35.99
$ 10.00