Fast Data Processing with Spark 2 - Third Edition

Learn how to use Spark to process big data at speed and scale for sharper analytics. Put the principles into practice for faster, slicker big data projects.
Preview in Mapt

Fast Data Processing with Spark 2 - Third Edition

Krishna Sankar

Learn how to use Spark to process big data at speed and scale for sharper analytics. Put the principles into practice for faster, slicker big data projects.
Mapt Subscription
FREE
$29.99/m after trial
eBook
$10.00
RRP $31.99
Save 68%
Print + eBook
$39.99
RRP $39.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$10.00
$39.99
$29.99 p/m after trial
RRP $31.99
RRP $39.99
Subscription
eBook
Print + eBook
Start 14 Day Trial

Frequently bought together


Fast Data Processing with Spark 2 - Third Edition Book Cover
Fast Data Processing with Spark 2 - Third Edition
$ 31.99
$ 10.00
Learning PySpark Book Cover
Learning PySpark
$ 35.99
$ 10.00
Buy 2 for $20.00
Save $47.98
Add to Cart

Book Details

ISBN 139781785889271
Paperback274 pages

Book Description

When people want a way to process Big Data at speed, Spark is invariably the solution. With its ease of development (in comparison to the relative complexity of Hadoop), it’s unsurprising that it’s becoming popular with data analysts and engineers everywhere.

Beginning with the fundamentals, we’ll show you how to get set up with Spark with minimum fuss. You’ll then get to grips with some simple APIs before investigating machine learning and graph processing – throughout we’ll make sure you know exactly how to apply your knowledge.

You will also learn how to use the Spark shell, how to load data before finding out how to build and run your own Spark applications. Discover how to manipulate your RDD and get stuck into a range of DataFrame APIs. As if that’s not enough, you’ll also learn some useful Machine Learning algorithms with the help of Spark MLlib and integrating Spark with R. We’ll also make sure you’re confident and prepared for graph processing, as you learn more about the GraphX API.

Table of Contents

Chapter 1: Installing Spark and Setting Up Your Cluster
Directory organization and convention
Installing the prebuilt distribution
Building Spark from source
Spark topology
A single machine
Running Spark on EC2
Deploying Spark with Chef (Opscode)
Deploying Spark on Mesos
Spark on YARN
Spark standalone mode
References
Summary
Chapter 2: Using the Spark Shell
The Spark shell
Loading a simple text file
Interactively loading data from S3
Summary
Chapter 3: Building and Running a Spark Application
Building Spark applications
Data wrangling with iPython
Developing Spark with Eclipse
Developing Spark with other IDEs
Building your Spark job with Maven
Building your Spark job with something else
References
Summary
Chapter 4: Creating a SparkSession Object
SparkSession versus SparkContext
Building a SparkSession object
SparkContext - metadata
Shared Java and Scala APIs
Python
iPython
Reference
Summary
Chapter 5: Loading and Saving Data in Spark
Spark abstractions
Data modalities
Data modalities and Datasets/DataFrames/RDDs
Loading data into an RDD
Saving your data
References
Summary
Chapter 6: Manipulating Your RDD
Manipulating your RDD in Scala and Java
Manipulating your RDD in Python
References
Summary
Chapter 7: Spark 2.0 Concepts
Code and Datasets for the rest of the book
The data scientist and Spark features
Spark v2.0 and beyond
Apache Spark - evolution
Apache Spark - the full stack
The art of a big data store - Parquet
References
Summary
Chapter 8: Spark SQL
The Spark SQL architecture
Spark SQL how-to in a nutshell
Spark SQL programming
References
Summary
Chapter 9: Foundations of Datasets/DataFrames – The Proverbial Workhorse for DataScientists
Datasets - a quick introduction
Dataset APIs - an overview
Dataset interfaces and functions
References
Summary
Chapter 10: Spark with Big Data
Parquet - an efficient and interoperable big data format
HBase
Reference
Summary
Chapter 11: Machine Learning with Spark ML Pipelines
Spark's machine learning algorithm table
Spark machine learning APIs - ML pipelines and MLlib
ML pipelines
Spark ML examples
The API organization
Basic statistics
Linear regression
Classification
Clustering
Recommendation
Hyper parameters
The final thing
References
Summary
Chapter 12: GraphX
Graphs and graph processing - an introduction
Spark GraphX
GraphX - computational model
The first example - graph
Building graphs
The GraphX API landscape
Structural APIs
Community, affiliation, and strengths
Algorithms
Partition strategy
Case study - AlphaGo tweets analytics
References
Summary

What You Will Learn

  • Install and set up Spark in your cluster
  • Prototype distributed applications with Spark's interactive shell
  • Perform data wrangling using the new DataFrame APIs
  • Get to know the different ways to interact with Spark's distributed representation of data (RDDs)
  • Query Spark with a SQL-like query syntax
  • See how Spark works with Big Data
  • Implement machine learning systems with highly scalable algorithms
  • Use R, the popular statistical language, to work with Spark
  • Apply interesting graph algorithms and graph processing with GraphX

Authors

Table of Contents

Chapter 1: Installing Spark and Setting Up Your Cluster
Directory organization and convention
Installing the prebuilt distribution
Building Spark from source
Spark topology
A single machine
Running Spark on EC2
Deploying Spark with Chef (Opscode)
Deploying Spark on Mesos
Spark on YARN
Spark standalone mode
References
Summary
Chapter 2: Using the Spark Shell
The Spark shell
Loading a simple text file
Interactively loading data from S3
Summary
Chapter 3: Building and Running a Spark Application
Building Spark applications
Data wrangling with iPython
Developing Spark with Eclipse
Developing Spark with other IDEs
Building your Spark job with Maven
Building your Spark job with something else
References
Summary
Chapter 4: Creating a SparkSession Object
SparkSession versus SparkContext
Building a SparkSession object
SparkContext - metadata
Shared Java and Scala APIs
Python
iPython
Reference
Summary
Chapter 5: Loading and Saving Data in Spark
Spark abstractions
Data modalities
Data modalities and Datasets/DataFrames/RDDs
Loading data into an RDD
Saving your data
References
Summary
Chapter 6: Manipulating Your RDD
Manipulating your RDD in Scala and Java
Manipulating your RDD in Python
References
Summary
Chapter 7: Spark 2.0 Concepts
Code and Datasets for the rest of the book
The data scientist and Spark features
Spark v2.0 and beyond
Apache Spark - evolution
Apache Spark - the full stack
The art of a big data store - Parquet
References
Summary
Chapter 8: Spark SQL
The Spark SQL architecture
Spark SQL how-to in a nutshell
Spark SQL programming
References
Summary
Chapter 9: Foundations of Datasets/DataFrames – The Proverbial Workhorse for DataScientists
Datasets - a quick introduction
Dataset APIs - an overview
Dataset interfaces and functions
References
Summary
Chapter 10: Spark with Big Data
Parquet - an efficient and interoperable big data format
HBase
Reference
Summary
Chapter 11: Machine Learning with Spark ML Pipelines
Spark's machine learning algorithm table
Spark machine learning APIs - ML pipelines and MLlib
ML pipelines
Spark ML examples
The API organization
Basic statistics
Linear regression
Classification
Clustering
Recommendation
Hyper parameters
The final thing
References
Summary
Chapter 12: GraphX
Graphs and graph processing - an introduction
Spark GraphX
GraphX - computational model
The first example - graph
Building graphs
The GraphX API landscape
Structural APIs
Community, affiliation, and strengths
Algorithms
Partition strategy
Case study - AlphaGo tweets analytics
References
Summary

Book Details

ISBN 139781785889271
Paperback274 pages
Read More

Read More Reviews

Recommended for You

Learning PySpark Book Cover
Learning PySpark
$ 35.99
$ 10.00
Mastering Apache Spark 2.x - Second Edition Book Cover
Mastering Apache Spark 2.x - Second Edition
$ 39.99
$ 10.00
Apache Spark 2.x Cookbook Book Cover
Apache Spark 2.x Cookbook
$ 39.99
$ 10.00
Machine Learning with Spark - Second Edition Book Cover
Machine Learning with Spark - Second Edition
$ 39.99
$ 10.00
Getting Started with TensorFlow Book Cover
Getting Started with TensorFlow
$ 27.99
$ 10.00
Mastering Docker Book Cover
Mastering Docker
$ 39.99
$ 10.00