Frank Kane's Taming Big Data with Apache Spark and Python

Frank Kane’s hands-on Spark training course, based on his bestselling Taming Big Data with Apache Spark and Python video, now available in a book. Understand and analyze large data sets using Spark on a single system or on a cluster.
Preview in Mapt

Frank Kane's Taming Big Data with Apache Spark and Python

Frank Kane

1 customer reviews
Frank Kane’s hands-on Spark training course, based on his bestselling Taming Big Data with Apache Spark and Python video, now available in a book. Understand and analyze large data sets using Spark on a single system or on a cluster.
Mapt Subscription
FREE
$29.99/m after trial
eBook
$22.40
RRP $31.99
Save 29%
Print + eBook
$39.99
RRP $39.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$22.40
$39.99
$29.99p/m after trial
RRP $31.99
RRP $39.99
Subscription
eBook
Print + eBook
Start 30 Day Trial

Frequently bought together


Frank Kane's Taming Big Data with Apache Spark and Python Book Cover
Frank Kane's Taming Big Data with Apache Spark and Python
$ 31.99
$ 22.40
Taming Big Data with Apache Spark and Python - Hands On! [Video] Book Cover
Taming Big Data with Apache Spark and Python - Hands On! [Video]
$ 79.99
$ 68.00
Buy 2 for $35.00
Save $76.98
Add to Cart
Subscribe and access every Packt eBook & Video.
 
  • 5,000+ eBooks & Videos
  • 50+ New titles a month
  • 1 Free eBook/Video to keep every month
Start Free Trial
 

Book Details

ISBN 139781787287945
Paperback296 pages

Book Description

Frank Kane’s Taming Big Data with Apache Spark and Python is your companion to learning Apache Spark in a hands-on manner. Frank will start you off by teaching you how to set up Spark on a single system or on a cluster, and you’ll soon move on to analyzing large data sets using Spark RDD, and developing and running effective Spark jobs quickly using Python.

Apache Spark has emerged as the next big thing in the Big Data domain – quickly rising from an ascending technology to an established superstar in just a matter of years. Spark allows you to quickly extract actionable insights from large amounts of data, on a real-time basis, making it an essential tool in many modern businesses.

Frank has packed this book with over 15 interactive, fun-filled examples relevant to the real world, and he will empower you to understand the Spark ecosystem and implement production-grade real-time Spark projects with ease.

Table of Contents

Chapter 1: Getting Started with Spark
Getting set up - installing Python, a JDK, and Spark and its dependencies
Installing the MovieLens movie rating dataset
Run your first Spark program - the ratings histogram example
Summary
Chapter 2: Spark Basics and Spark Examples
What is Spark?
The Resilient Distributed Dataset (RDD)
Ratings histogram walk-through
Key/value RDDs and the average friends by age example
Running the average friends by age example
Filtering RDDs and the minimum temperature by location example
Running the minimum temperature example and modifying it for maximums
Running the maximum temperature by location example
Counting word occurrences using flatmap()
Improving the word-count script with regular expressions
Sorting the word count results
Find the total amount spent by customer
Check your results and sort them by the total amount spent
Check your sorted implementation and results against mine
Summary
Chapter 3: Advanced Examples of Spark Programs
Finding the most popular movie
Using broadcast variables to display movie names instead of ID numbers
Finding the most popular superhero in a social graph
Running the script - discover who the most popular superhero is
Superhero degrees of separation - introducing the breadth-first search algorithm
Accumulators and implementing BFS in Spark
Superhero degrees of separation - review the code and run it
Item-based collaborative filtering in Spark, cache(), and persist()
Running the similar-movies script using Spark's cluster manager
Improving the quality of the similar movies example
Summary
Chapter 4: Running Spark on a Cluster
Introducing Elastic MapReduce
Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY
Partitioning
Creating similar movies from one million ratings - part 1
Creating similar movies from one million ratings - part 2
Creating similar movies from one million ratings – part 3
Troubleshooting Spark on a cluster
More troubleshooting and managing dependencies
Summary
Chapter 5: SparkSQL, DataFrames, and DataSets
Introducing SparkSQL
Executing SQL commands and SQL-style functions on a DataFrame
Using DataFrames instead of RDDs
Summary
Chapter 6: Other Spark Technologies and Libraries
Introducing MLlib
Using MLlib to produce movie recommendations
Analyzing the ALS recommendations results
Using DataFrames with MLlib
Spark Streaming and GraphX
Summary
Chapter 7: Where to Go From Here? – Learning More About Spark and Data Science

What You Will Learn

  • Find out how you can identify Big Data problems as Spark problems
  • Install and run Apache Spark on your computer or on a cluster
  • Analyze large data sets across many CPUs using Spark’s Resilient Distributed Datasets
  • Implement machine learning on Spark using the MLlib library
  • Process continuous streams of data in real time using the Spark streaming module
  • Perform complex network analysis using Spark’s GraphX library
  • Use Amazon's Elastic MapReduce service to run your Spark jobs on a cluster

Authors

Table of Contents

Chapter 1: Getting Started with Spark
Getting set up - installing Python, a JDK, and Spark and its dependencies
Installing the MovieLens movie rating dataset
Run your first Spark program - the ratings histogram example
Summary
Chapter 2: Spark Basics and Spark Examples
What is Spark?
The Resilient Distributed Dataset (RDD)
Ratings histogram walk-through
Key/value RDDs and the average friends by age example
Running the average friends by age example
Filtering RDDs and the minimum temperature by location example
Running the minimum temperature example and modifying it for maximums
Running the maximum temperature by location example
Counting word occurrences using flatmap()
Improving the word-count script with regular expressions
Sorting the word count results
Find the total amount spent by customer
Check your results and sort them by the total amount spent
Check your sorted implementation and results against mine
Summary
Chapter 3: Advanced Examples of Spark Programs
Finding the most popular movie
Using broadcast variables to display movie names instead of ID numbers
Finding the most popular superhero in a social graph
Running the script - discover who the most popular superhero is
Superhero degrees of separation - introducing the breadth-first search algorithm
Accumulators and implementing BFS in Spark
Superhero degrees of separation - review the code and run it
Item-based collaborative filtering in Spark, cache(), and persist()
Running the similar-movies script using Spark's cluster manager
Improving the quality of the similar movies example
Summary
Chapter 4: Running Spark on a Cluster
Introducing Elastic MapReduce
Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY
Partitioning
Creating similar movies from one million ratings - part 1
Creating similar movies from one million ratings - part 2
Creating similar movies from one million ratings – part 3
Troubleshooting Spark on a cluster
More troubleshooting and managing dependencies
Summary
Chapter 5: SparkSQL, DataFrames, and DataSets
Introducing SparkSQL
Executing SQL commands and SQL-style functions on a DataFrame
Using DataFrames instead of RDDs
Summary
Chapter 6: Other Spark Technologies and Libraries
Introducing MLlib
Using MLlib to produce movie recommendations
Analyzing the ALS recommendations results
Using DataFrames with MLlib
Spark Streaming and GraphX
Summary
Chapter 7: Where to Go From Here? – Learning More About Spark and Data Science

Book Details

ISBN 139781787287945
Paperback296 pages
Read More
From 1 reviews

Read More Reviews

Recommended for You

Taming Big Data with Apache Spark and Python - Hands On! [Video] Book Cover
Taming Big Data with Apache Spark and Python - Hands On! [Video]
$ 79.99
$ 68.00
Taming Big Data with Spark Streaming and Scala - Hands On! [Video] Book Cover
Taming Big Data with Spark Streaming and Scala - Hands On! [Video]
$ 79.99
$ 68.00
Taming Big Data with MapReduce and Hadoop - Hands On! [Video] Book Cover
Taming Big Data with MapReduce and Hadoop - Hands On! [Video]
$ 79.99
$ 68.00
Data Science and Machine Learning with Python - Hands On! [Video] Book Cover
Data Science and Machine Learning with Python - Hands On! [Video]
$ 98.99
$ 84.15
Hands-On Data Science and Python Machine Learning Book Cover
Hands-On Data Science and Python Machine Learning
$ 31.99
$ 22.40
Apache Spark with Scala [Video] Book Cover
Apache Spark with Scala [Video]
$ 98.99
$ 84.15