Big Data Processing with Apache Spark

5 (1 reviews total)
By Manuel Ignacio Franco Galeano
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies

About this book

Processing big data in real time is challenging due to scalability, information consistency, and fault-tolerance. This book teaches you how to use Spark to make your overall analytical workflow faster and more efficient. You'll explore all core concepts and tools within the Spark ecosystem, such as Spark Streaming, the Spark Streaming API, machine learning extension, and structured streaming.

You'll begin by learning data processing fundamentals using Resilient Distributed Datasets (RDDs), SQL, Datasets, and Dataframes APIs. After grasping these fundamentals, you'll move on to using Spark Streaming APIs to consume data in real time from TCP sockets, and integrate Amazon Web Services (AWS) for stream consumption.

By the end of this book, you’ll not only have understood how to use machine learning extensions and structured streams but you’ll also be able to apply Spark in your own upcoming big data projects.

Publication date:
October 2018


Big Data Processing with Apache Spark

Data Processing with Apache Spark

Copyright © 2018 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this course to ensure the accuracy of the information presented. However, the information contained in this course is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this course.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this course by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Author: Manuel Franco

Reviewer: Amit Nandi

Managing Editor: Edwin Moses

Acquisitions Editor: Aditya Date

Production Editor: Nitesh Thakur

Editorial Board: David Barnes, Ewan Buckingham, Simon Cox, Manasa Kumar, Alex Mazonowicz, Douglas Paterson, Dominic Pereira, Shiny Poojary, Saman Siddiqui, Erol Staveley, Ankita Thakur, and Mohita Vyas

First Published: October 2018

Production Reference: 1311018

ISBN: 978-1-78980-881-0

Table of Contents


Introduction to Spark Distributed Processing


Introduction to Spark and Resilient Distributed Datasets

Spark Components

Spark Deployment Modes

Spark Standalone

Apache Mesos

Other Deployment Options

Resilient Distributed Datasets

Python Shell and SparkContext

Parallelized Collections

RDD Creation from External Data Sources

Exercise 1: Basic Interactive Analysis with Python

Operations Supported by the RDD API

Map Transformations

Reduce Action

Working with Key-Value Pairs

Join Transformations

Set Operations

Exercise 2: Map Reduce Operations

Activity 1: Statistical Operations on Books

Self-Contained Python Spark Programs

Introduction to Functional Programming

Exercise 3: Standalone Python Programs

Introduction to SQL, Datasets, and DataFrames

Exercise 4: Downloading the Reduced Version of the movielens Dataset

Exercise 5: RDD Operations in DataFrame Objects


Introduction to Spark Streaming


Introduction to Streaming Architectures

Back-Pressure, Write-Ahead Logging, and Checkpointing

Introduction to Discretized Streams

Consuming Streams from a TCP Socket

TCP Input DStream

Map-Reduce Operations over DStreams

Exercise 6: Building an Event TCP Server

Activity 2: Building a Simple TCP Spark Stream Consumer

Parallel Recovery of State with Checkpointing

Keeping the State in Streaming Applications

Join Operations

Exercise 7: TCP Stream Consumer from Multiple Sources

Activity 3: Consuming Event Data from Three TCP Servers

Windowing Operations

Exercise 8: Distributed Log Server

Introduction to Structured Streaming

Result Table and Output Modes in Structured Streaming

Exercise 9: Writing Random Ratings

Exercise 10: Structured Streaming


Spark Streaming Integration with AWS


Spark Integration with AWS Services

Previous Requirements

AWS Kinesis Data Streams Basic Functionality

Integrating AWS Kinesis and Python

Exercise 11: Listing Existing Streams

Exercise 12: Creating a New Stream

Exercise 13: Deleting an Existing Stream

Exercise 14: Pushing Data to a Stream

AWS S3 Basic Functionality

Creating, Listing, and Deleting AWS S3 Buckets

Exercise 15: Listing Existing Buckets

Exercise 16: Creating a Bucket

Exercise 17: Deleting a Bucket

Kinesis Streams and Spark Streams

Activity 4: AWS and Spark Pipeline


Spark Streaming, ML, and Windowing Operations


Spark Integration with Machine Learning

The MovieLens Dataset

Introduction to Recommendation Systems and Collaborative Filtering

Exercise 18: Collaborative Filtering and Spark

Exercise 19: Creating a TCP Server that Publishes User Ratings

Exercise 20: Spark Streams Integration with Machine Learning

Activity 5: Experimenting with Windowing Operations


Appendix A

About the Author

  • Manuel Ignacio Franco Galeano

    Manuel Ignacio Franco Galeano is a computer scientist from Colombia. He works for Fender Musical Instruments as a lead engineer in Dublin, Ireland. He holds a master's degree in computer science from University College, Dublin UCD. His areas of interest and research are music information retrieval, data analytics, distributed systems, and blockchain technologies.

    Browse publications by this author

Latest Reviews

(1 reviews total)
Very useful examples and comprehensive explanations.

Recommended For You

Apache Spark 2: Data Processing and Real-Time Analytics

Build efficient data flow and machine learning programs with this flexible, multi-functional open-source cluster-computing framework

By Romeo Kienzler and 6 more
Big Data Analysis with Python

Get to grips with processing large volumes of data and presenting it as engaging, interactive insights using Spark and Python.

By Ivan Marin and 2 more
Hands-On Big Data Analytics with PySpark

Use PySpark to easily crush messy data at-scale and discover proven techniques to create testable, immutable, and easily parallelizable Spark jobs

By Rudy Lai and 1 more
Python Machine Learning - Third Edition

Applied machine learning with a solid foundation in theory. Revised and expanded for TensorFlow 2, GANs, and reinforcement learning.

By Sebastian Raschka and 1 more