Mastering Apache Spark 2.x - Second Edition

Advanced analytics on your Big Data with latest Apache Spark 2.x

Access cutting-edge content as it's created

Want access to this book right now? Read as we develop it as part of our Early Access program. Click here to find out more about Early Access.

Mastering Apache Spark 2.x - Second Edition

Romeo Kienzler

Advanced analytics on your Big Data with latest Apache Spark 2.x

Access cutting-edge content as it's created

Want access to this book right now? Read as we develop it as part of our Early Access program. Click here to find out more about Early Access.

$10.00
$54.99
RRP $43.99
RRP $54.99
Early Access eBook
Pre-Order Print
Access every Packt eBook & Video for just $100
 
  • 4,000+ eBooks & Videos
  • 40+ New titles a month
  • 1 Free eBook/Video to keep every month
Find Out More
 
Code Files
Preview in Mapt

Book Details

ISBN 139781786462749
Paperback369 pages

Book Description

Apache Spark is an in-memory cluster based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and SQL. This book aims to take your limited knowledge of Spark to the next level by teaching you how to expand Spark functionality and implement your data flows and machine/deep learning programs on top of the platform.

The book commences with an overview of the Spark eco-system. The book will introduce you to Project Tungsten and Catalyst, one of the two major advancements of Apache Spark V2.X. You will understand how Memory Management and Binary Processing, Cache-aware Computation and Code Generation are used to speed things up dramatically. The book extends to show how to incorporate H20, SystemML and Deeplearning4j for machine learning, Juypter Notebooks and Kubernetes/Docker for cloud-based Spark. During the course of the book, you will learn about the latest enhancements in Apache Spark 2.x such as Interactive querying of live data, unifying data frames and data sets, and so on.

You will also learn about update in Accumulative APIs and DataFrame-based ML API. You will learn to use Spark as a Compiler, understand how to implement structure streaming, and thus explore how easy it is to use Spark in day-to-day tasks.

Table of Contents

Chapter 1: A First Taste and What’s New in Apache Spark V2
Spark machine learning
Spark Streaming
Spark SQL
Spark graph processing
Extended ecosystem
What’s new in Apache Spark V2?
Cluster design
Cluster management
Cloud based deployments
Performance
Cloud
Summary
Chapter 2: Apache Spark SQL
The SparkSession - your gateway into structured data processing
Importing and saving data
Understanding the DataSource API
DataFrames
Using SQL
Using Datasets
User-defined functions
RDD's vs DataFrames vs Datasets
Summary
Chapter 3: The Catalyst Optimizer
Understanding the working of the Catalyst optimizer
Managing temporary views with the Catalog API
The SQL Abstract Syntax Tree (AST)
How to go from Unresolved Logical Execution Plan (ULEP) to Resolved Logical Execution Plan
Code generation
Summary
Chapter 4: Project Tungsten
Memory management beyond the Java Virtual Machine (JVM) Garbage Collector (GC)
Cache friendly layout of data in memory
Code generation
Summary

What You Will Learn

  • Advanced Machine Learning and DeepLearning with MLlib, SparkML, SystemML, H2O and DeepLearning4J
  • Highly optimised unified batch and real-time data processing using SparkSQL and Structured Streaming
  • Large-scale Graph Processing and Analysis using GraphX and GraphFrames
  • Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud

Authors

Table of Contents

Chapter 1: A First Taste and What’s New in Apache Spark V2
Spark machine learning
Spark Streaming
Spark SQL
Spark graph processing
Extended ecosystem
What’s new in Apache Spark V2?
Cluster design
Cluster management
Cloud based deployments
Performance
Cloud
Summary
Chapter 2: Apache Spark SQL
The SparkSession - your gateway into structured data processing
Importing and saving data
Understanding the DataSource API
DataFrames
Using SQL
Using Datasets
User-defined functions
RDD's vs DataFrames vs Datasets
Summary
Chapter 3: The Catalyst Optimizer
Understanding the working of the Catalyst optimizer
Managing temporary views with the Catalog API
The SQL Abstract Syntax Tree (AST)
How to go from Unresolved Logical Execution Plan (ULEP) to Resolved Logical Execution Plan
Code generation
Summary
Chapter 4: Project Tungsten
Memory management beyond the Java Virtual Machine (JVM) Garbage Collector (GC)
Cache friendly layout of data in memory
Code generation
Summary

Book Details

ISBN 139781786462749
Paperback369 pages
Read More

Read More Reviews