Free Sample
+ Collection
Code Files

Fast Data Processing with Spark

Holden Karau

Spark offers a streamlined way to write distributed programs and this tutorial gives you the know-how as a software developer to make the most of Spark’s many great features, providing an extra string to your bow.
RRP $22.99
RRP $37.99
Print + eBook

Want this title & more?

$12.99 p/month

Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781782167068
Paperback120 pages

About This Book

  • Implement Spark's interactive shell to prototype distributed applications
  • Deploy Spark jobs to various clusters such as Mesos, EC2, Chef, YARN, EMR, and so on
  • Use Shark's SQL query-like syntax with Spark

Who This Book Is For

Fast Data Processing with Spark is for software developers who want to learn how to write distributed programs with Spark. It will help developers who have had problems that were too much to be dealt with on a single computer. No previous experience with distributed programming is necessary. This book assumes knowledge of either Java, Scala, or Python.

Table of Contents

Chapter 1: Installing Spark and Setting Up Your Cluster
Running Spark on a single machine
Running Spark on EC2
Deploying Spark on Elastic MapReduce
Deploying Spark with Chef (opscode)
Deploying Spark on Mesos
Deploying Spark on YARN
Deploying set of machines over SSH
Links and references
Chapter 2: Using the Spark Shell
Loading a simple text file
Using the Spark shell to run logistic regression
Interactively loading data from S3
Chapter 3: Building and Running a Spark Application
Building your Spark project with sbt
Building your Spark job with Maven
Building your Spark job with something else
Chapter 4: Creating a SparkContext
Shared Java and Scala APIs
Links and references
Chapter 5: Loading and Saving Data in Spark
Loading data into an RDD
Saving your data
Links and references
Chapter 6: Manipulating Your RDD
Manipulating your RDD in Scala and Java
Manipulating your RDD in Python
Links and references
Chapter 7: Shark – Using Spark with Hive
Why Hive/Shark?
Installing Shark
Running Shark
Loading data
Using Hive queries in a Spark program
Links and references
Chapter 8: Testing
Testing in Java and Scala
Testing in Python
Links and references
Chapter 9: Tips and Tricks
Where to find logs?
Concurrency limitations
Memory usage and garbage collection
IDE integration
Using Spark with other languages
A quick note on security
Mailing lists
Links and references

What You Will Learn

  • Prototype distributed applications with Spark's interactive shell
  • Learn different ways to interact with Spark's distributed representation of data (RDDs)
  • Load data from the various data sources
  • Query Spark with a SQL-like query syntax
  • Integrate Shark queries with Spark programs
  • Effectively test your distributed software
  • Tune a Spark installation
  • Install and set up Spark on your cluster
  • Work effectively with large data sets

In Detail

Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.

Fast Data Processing with Spark covers how to write distributed map reduce style programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API, to deploying your job to the cluster, and tuning it for your purposes.

Fast Data Processing with Spark covers everything from setting up your Spark cluster in a variety of situations (stand-alone, EC2, and so on), to how to use the interactive shell to write distributed code interactively. From there, we move on to cover how to write and deploy distributed jobs in Java, Scala, and Python.

We then examine how to use the interactive shell to quickly prototype distributed programs and explore the Spark API. We also look at how to use Hive with Spark to use a SQL-like query syntax with Shark, as well as manipulating resilient distributed datasets (RDDs).


Read More