Taming Big Data with Apache Spark and Python - Hands On! [Video]
-
Free ChapterGetting Started with Spark
-
Spark Basics and Simple Examples
- What's new in Spark 3?
- Introduction to Spark
- The Resilient Distributed Dataset (RDD)
- Ratings Histogram Walkthrough
- Key/Value RDDs and the Average Friends by Age Example
- Running the Average Friends by Age Example
- Filtering RDDs and the Minimum Temperature by Location Example
- Running the Minimum Temperature Example and Modifying It for Maximums
- Running the Maximum Temperature by Location Example
- Counting Word Occurrences Using flatmap()
- Improving the Word Count Script with Regular Expressions
- Sorting the Word Count Results
- Find the Total Amount Spent by Customer
- Check your Results, and Now Sort them by Total Amount Spent.
- Check Your Sorted Implementation and Results Against Mine.
-
Advanced Examples of Spark Programs
- Find the Most Popular Movie
- Use Broadcast Variables to Display Movie Names Instead of ID Numbers
- Find the Most Popular Superhero in a Social Graph
- Run the Script - Discover Who the Most Popular Superhero is!
- Superhero Degrees of Separation - Introducing Breadth-First Search
- Superhero Degrees of Separation - Accumulators and Implementing BFS in Spark
- Superhero Degrees of Separation - Review the Code and Run it
- Item-Based Collaborative Filtering in Spark, cache (), and persist ()
- Running the Similar Movies Script Using Spark's Cluster
- Improve the Quality of Similar Movies
-
Running Spark on a Cluster
- Introducing Elastic MapReduce
- Setting Up Your AWS / Elastic MapReduce Account and PuTTY
- Partitioning
- Create Similar Movies from One Million Ratings - Part 1
- Create Similar Movies from One Million Ratings - Part 2
- Create Similar Movies from One Million Ratings - Part 3
- Troubleshooting Spark on a Cluster
- More Troubleshooting and Managing Dependencies
-
SparkSQL, DataFrames, and DataSets
-
Other Spark Technologies and Libraries
-
You Made It! Where to Go from Here.
“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark.This has been updated to Spark 3. Employers including Amazon, EBay, NASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think.
Learn and master the art of framing data analysis problems as Spark problems through over 15 hands-on examples, and then scale them up to run on cloud computing services in this course. You'll be learning from an ex-engineer and senior manager from Amazon and IMDb.
• Learn the concepts of Spark's Resilient Distributed Datastores
• Develop and run Spark jobs quickly using Python
• Translate complex analysis problems into iterative or multi-stage Spark scripts
• Scale up to larger data sets using Amazon's Elastic MapReduce service
• Understand how Hadoop YARN distributes Spark across computing clusters
• Learn about other Spark technologies, like Spark SQL, Spark Streaming, and GraphX
By the end of this course, you'll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes.
All the codes and supporting files for this course are available at - https://github.com/PacktPublishing/Taming-Big-Data-with-Apache-Spark-and-Python---Hands-On-..
- Publication date:
- September 2016
- Publisher
- Packt
- Duration
- 5 hours 29 minutes
- ISBN
- 9781787129931