Reader small image

You're reading from  Data Engineering with Python

Product typeBook
Published inOct 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839214189
Edition1st Edition
Languages
Right arrow
Author (1)
Paul Crickard
Paul Crickard
author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

Right arrow

Chapter 14: Data Processing with Apache Spark

In the previous chapter, you learned how to add streaming data to your data pipelines. Using Python or Apache NiFi, you can extract, transform, and load streaming data. However, to perform transformations on large amounts of streaming data, data engineers turn to tools such as Apache Spark. Apache Spark is faster than most other methods – such as MapReduce on non-trivial transformations – and it allows distributed data processing.

In this chapter, we're going to cover the following main topics:

  • Installing and running Spark
  • Installing and configuring PySpark
  • Processing data with PySpark

Installing and running Spark

Apache Spark is a distributed data processing engine that can handle both streams and batch data, and even graphs. It has a core set of components and other libraries that are used to add functionality. A common depiction of the Spark ecosystem is shown in the following diagram:

Figure 14.1 – The Apache Spark ecosystem

To run Spark as a cluster, you have several options. Spark can run in a standalone mode, which uses a simple cluster manager provided by Spark. It can also run on an Amazon EC2 instance, using YARN, Mesos, or Kubernetes. In a production environment with a significant workload, you would probably not want to run in standalone mode; however, this is how we will stand up our cluster in this chapter. The principles will be the same, but the standalone cluster provides the fastest way to get you up and running without needing to dive into more complicated infrastructure.

To install Apache Spark, take the following...

Installing and configuring PySpark

PySpark is installed with Spark. You can see it in the ~/spark3/bin directory, as well as other libraries and tools. To configure PySpark to run, you need to export environment variables. The variables are shown here:

export SPARK_HOME=/home/paulcrickard/spark3
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3 

The preceding command set the SPARK_HOME variable. This will be where you installed Spark. I have pointed the variable to the head of the Spark cluster because the node would really be on another machine. Then, it adds SPARK_HOME to your path. This means that when you type a command, the operating system will look for it in the directories specified in your path, so now it will search ~/spark3/bin, which is where PySpark lives.

Running the preceding commands in a terminal will allow Spark to run while the terminal is open. You will have to rerun these commands every time. To make them permanent, you can add the commands...

Processing data with PySpark

Before processing data with PySpark, let's run one of the samples to show how Spark works. Then, we will skip the boilerplate in later examples and focus on data processing. The Jupyter notebook for the Pi Estimation example from the Spark website at http://spark.apache.org/examples.html is shown in the following screenshot:

Figure 14.6 – The Pi Estimation example in a Jupyter notebook

The example from the website will not run without some modifications. In the following points, I will walk through the cells:

  1. The first cell imports findspark and runs the init() method. This was explained in the preceding section as the preferred method to include PySpark in Jupyter notebooks. The code is as follows:
    import findspark
    findspark.init()
  2. The next cell imports the pyspark library and SparkSession. It then creates the session by passing the head node of the Spark cluster. You can get the URL from the Spark web UI...

Summary

In this chapter, you learned the basics of working with Apache Spark. First, you downloaded and installed Spark and configured PySpark to run in Jupyter notebooks. You also learned how to scale Spark horizontally by adding nodes. Spark uses DataFrames similar to those used in pandas. The last section taught you the basics of manipulating data in Spark.

In the next chapter, you will use Spark with Apache MiNiFi to move data at the edge or on Internet-of-Things devices.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Python
Published in: Oct 2020Publisher: PacktISBN-13: 9781839214189
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard