Apache Spark 2.x Cookbook

Over 70 recipes to help you use Apache Spark as your single big data computing platform and master its libraries

Apache Spark 2.x Cookbook

Rishi Yadav

Over 70 recipes to help you use Apache Spark as your single big data computing platform and master its libraries
Mapt Subscription
FREE
$29.99/m after trial
eBook
$28.00
RRP $39.99
Save 29%
Print + eBook
$49.99
RRP $49.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$28.00
$49.99
$29.99p/m after trial
RRP $39.99
RRP $49.99
Subscription
eBook
Print + eBook
Start 30 Day Trial
Subscribe and access every Packt eBook & Video.
 
  • 5,000+ eBooks & Videos
  • 50+ New titles a month
  • 1 Free eBook/Video to keep every month
Start Free Trial
 
Code Files
Preview in Mapt

Book Details

ISBN 139781787127265
Paperback294 pages

Book Description

While Apache Spark 1.x gained a lot of traction and adoption in the early years, Spark 2.x delivers notable improvements in the areas of API, schema awareness, Performance, Structured Streaming, and simplifying building blocks to build better, faster, smarter, and more accessible big data applications. This book uncovers all these features in the form of structured recipes to analyze and mature large and complex sets of data.

Starting with installing and configuring Apache Spark with various cluster managers, you will learn to set up development environments. Further on, you will be introduced to working with RDDs, DataFrames and Datasets to operate on schema aware data, and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will also work through recipes on machine learning, including supervised learning, unsupervised learning & recommendation engines in Spark.

Last but not least, the final few chapters delve deeper into the concepts of graph processing using GraphX, securing your implementations, cluster optimization, and troubleshooting.

Table of Contents

Chapter 1: Getting Started with Apache Spark
Introduction
Leveraging Databricks Cloud
Deploying Spark using Amazon EMR
Installing Spark from binaries
Building the Spark source code with Maven
Launching Spark on Amazon EC2
Deploying Spark on a cluster in standalone mode
Deploying Spark on a cluster with Mesos
Deploying Spark on a cluster with YARN
Understanding SparkContext and SparkSession
Understanding resilient distributed dataset - RDD
Chapter 2: Developing Applications with Spark
Introduction
Exploring the Spark shell
Developing a Spark applications in Eclipse with Maven
Developing a Spark applications in Eclipse with SBT
Developing a Spark application in IntelliJ IDEA with Maven
Developing a Spark application in IntelliJ IDEA with SBT
Developing applications using the Zeppelin notebook
Setting up Kerberos to do authentication
Enabling Kerberos authentication for Spark
Chapter 3: Spark SQL
Understanding the evolution of schema awareness
Understanding the Catalyst optimizer
Inferring schema using case classes
Programmatically specifying the schema
Understanding the Parquet format
Loading and saving data using the JSON format
Loading and saving data from relational databases
Loading and saving data from an arbitrary source
Understanding joins
Analyzing nested structures
Chapter 4: Working with External Data Sources
Introduction
Loading data from the local filesystem
Loading data from HDFS
Loading data from Amazon S3
Loading data from Apache Cassandra
Chapter 5: Spark Streaming
Introduction
WordCount using Structured Streaming
Taking a closer look at Structured Streaming
Streaming Twitter data
Streaming using Kafka
Understanding streaming challenges
Chapter 6: Getting Started with Machine Learning
Introduction
Creating vectors
Calculating correlation
Understanding feature engineering
Understanding Spark ML
Understanding hyperparameter tuning
Chapter 7: Supervised Learning with MLlib — Regression
Introduction
Using linear regression
Understanding the cost function
Doing linear regression with lasso
Doing ridge regression
Chapter 8: Supervised Learning with MLlib — Classification
Introduction
Doing classification using logistic regression
Doing binary classification using SVM
Doing classification using decision trees
Doing classification using random forest
Doing classification using gradient boosted trees
Doing classification with Naïve Bayes
Chapter 9: Unsupervised Learning
Introduction
Clustering using k-means
Dimensionality reduction with principal component analysis
Dimensionality reduction with singular value decomposition
Chapter 10: Recommendations Using Collaborative Filtering
Introduction
Collaborative filtering using explicit feedback
Collaborative filtering using implicit feedback
Chapter 11: Graph Processing Using GraphX and GraphFrames
Introduction
Fundamental operations on graphs
Using PageRank
Finding connected components
Performing neighborhood aggregation
Understanding GraphFrames
Chapter 12: Optimizations and Performance Tuning
Optimizing memory
Leveraging speculation
Optimizing joins
Using compression to improve performance
Using serialization to improve performance
Optimizing the level of parallelism
Understanding project Tungsten

What You Will Learn

  • Install and configure Apache Spark with various cluster managers & on AWS
  • Set up a development environment for Apache Spark including Databricks Cloud notebook
  • Find out how to operate on data in Spark with schemas
  • Get to grips with real-time streaming analytics using Spark Streaming & Structured Streaming
  • Master supervised learning and unsupervised learning using MLlib
  • Build a recommendation engine using MLlib
  • Graph processing using GraphX and GraphFrames libraries
  • Develop a set of common applications or project types, and solutions that solve complex big data problems

Authors

Table of Contents

Chapter 1: Getting Started with Apache Spark
Introduction
Leveraging Databricks Cloud
Deploying Spark using Amazon EMR
Installing Spark from binaries
Building the Spark source code with Maven
Launching Spark on Amazon EC2
Deploying Spark on a cluster in standalone mode
Deploying Spark on a cluster with Mesos
Deploying Spark on a cluster with YARN
Understanding SparkContext and SparkSession
Understanding resilient distributed dataset - RDD
Chapter 2: Developing Applications with Spark
Introduction
Exploring the Spark shell
Developing a Spark applications in Eclipse with Maven
Developing a Spark applications in Eclipse with SBT
Developing a Spark application in IntelliJ IDEA with Maven
Developing a Spark application in IntelliJ IDEA with SBT
Developing applications using the Zeppelin notebook
Setting up Kerberos to do authentication
Enabling Kerberos authentication for Spark
Chapter 3: Spark SQL
Understanding the evolution of schema awareness
Understanding the Catalyst optimizer
Inferring schema using case classes
Programmatically specifying the schema
Understanding the Parquet format
Loading and saving data using the JSON format
Loading and saving data from relational databases
Loading and saving data from an arbitrary source
Understanding joins
Analyzing nested structures
Chapter 4: Working with External Data Sources
Introduction
Loading data from the local filesystem
Loading data from HDFS
Loading data from Amazon S3
Loading data from Apache Cassandra
Chapter 5: Spark Streaming
Introduction
WordCount using Structured Streaming
Taking a closer look at Structured Streaming
Streaming Twitter data
Streaming using Kafka
Understanding streaming challenges
Chapter 6: Getting Started with Machine Learning
Introduction
Creating vectors
Calculating correlation
Understanding feature engineering
Understanding Spark ML
Understanding hyperparameter tuning
Chapter 7: Supervised Learning with MLlib — Regression
Introduction
Using linear regression
Understanding the cost function
Doing linear regression with lasso
Doing ridge regression
Chapter 8: Supervised Learning with MLlib — Classification
Introduction
Doing classification using logistic regression
Doing binary classification using SVM
Doing classification using decision trees
Doing classification using random forest
Doing classification using gradient boosted trees
Doing classification with Naïve Bayes
Chapter 9: Unsupervised Learning
Introduction
Clustering using k-means
Dimensionality reduction with principal component analysis
Dimensionality reduction with singular value decomposition
Chapter 10: Recommendations Using Collaborative Filtering
Introduction
Collaborative filtering using explicit feedback
Collaborative filtering using implicit feedback
Chapter 11: Graph Processing Using GraphX and GraphFrames
Introduction
Fundamental operations on graphs
Using PageRank
Finding connected components
Performing neighborhood aggregation
Understanding GraphFrames
Chapter 12: Optimizations and Performance Tuning
Optimizing memory
Leveraging speculation
Optimizing joins
Using compression to improve performance
Using serialization to improve performance
Optimizing the level of parallelism
Understanding project Tungsten

Book Details

ISBN 139781787127265
Paperback294 pages
Read More

Read More Reviews

Recommended for You

Apache Spark 2.x Machine Learning Cookbook Book Cover
Apache Spark 2.x Machine Learning Cookbook
$ 39.99
$ 28.00
Apache Spark 2.x for Java Developers Book Cover
Apache Spark 2.x for Java Developers
$ 39.99
$ 28.00
Mastering Apache Spark 2.x - Second Edition Book Cover
Mastering Apache Spark 2.x - Second Edition
$ 39.99
$ 28.00
Mastering Machine Learning with Spark 2.x Book Cover
Mastering Machine Learning with Spark 2.x
$ 39.99
$ 28.00
Hadoop 2.x Administration Cookbook Book Cover
Hadoop 2.x Administration Cookbook
$ 39.99
$ 28.00
Yahoo User Interface Library 2.x Cookbook Book Cover
Yahoo User Interface Library 2.x Cookbook
$ 26.99
$ 18.90