Packt+ | Advance your knowledge in tech

You're reading from Learning Apache Spark 2

Product type Book

Published in Mar 2017

Publisher Packt

ISBN-13 9781785885136

Pages 356 pages

Edition 1st Edition

Languages

Python

Concepts

Data Processing

Table of Contents (18) Chapters

Learning Apache Spark 2

Credits

About the Author

About the Reviewers

www.packtpub.com

Customer Feedback

Preface

1. Architecture and Installation

2. Transformations and Actions with Spark RDDs

3. ETL with Spark

4. Spark SQL

5. Spark Streaming

6. Machine Learning with Spark

7. GraphX

8. Operating in Clustered Mode

9. Building a Recommendation System

10. Customer Churn Prediction

1. Theres More with Spark

Chapter 3. ETL with Spark

So we have gone through the architecture of Spark, and have had some detailed level discussions around RDDs. By the end of Chapter 2, Transformations and Actions with Spark RDDs, we had focused on PairRDDs and some of the transformations.

This chapter focuses on doing ETL with Apache Spark. We'll cover the following topics, which hopefully will help you with taking the next step on Apache Spark:

Understanding the ETL process
Commonly supported file formats
Commonly supported filesystems
Working with NoSQL databases

Let's get started!

What is ETL?

ELT stands for Extraction, Transformation,and Loading. The term has been around for decades and it represents an industry standard representing the data movement and transformation process to build data pipelines to deliver BI and Analytics. ETL processes are widely used on the data migration and master data management initiatives. Since the focus of our book is on Spark, we'll lightly touch upon the subject of ETL, but will not go into more detail.

Exaction

Extraction is the first part of the ETL process representing the extraction of data from source systems. This is often one of the most important parts of the ETL process, and it sets the stage for further downstream processing. There are a few major things to consider during an extraction process:

The source system type (RDBMS, NoSQL, FlatFiles, Twitter/Facebook streams)
The file formats (CSV, JSON, XML, Parquet, Sequence, Object files)
The frequency of the extract ( Daily, Hourly, Every second)
The size of the extract...

How is Spark being used?

Matei Zaharia is the creator of Apache Spark project and co-founder of DataBricks, the company which was formed by the creators of Apache Spark. Matei in his keynote at the Spark summit in Europe during fall of 2015 mentioned some key metrics on how Spark is being used in various runtime environments. The numbers were a bit surprising to me, as I had thought Spark on YARN would have higher numbers than what was presented. Here are the key figures:

Spark in Standalone mode - 48%
Spark on YARN - 40%
Spark on MESOS - 11%

As we can see from the numbers, almost 90% of Apache Spark installations are in standalone mode or on YARN. When Spark is being configured on YARN, we can make an assumption that the organization has chosen Hadoop as their data operating system, and are planning to move their data onto Hadoop, which means our primary source of data ingest might be Hive, HDFS, HBase, or other No SQL systems.

When Apache Spark is installed in standalone mode, the possibility...

Commonly Supported File Formats

We've already seen the ease with which you can manipulate text files using Spark with the textFile() method on SparkContext. However, you'll be pleased to know that Apache Spark supports a large number of other formats, which are increasing with every release of Spark. With Apache Spark release 2.0, the following file formats are supported out of the box:

TextFiles (already covered)
JSON files
CSV Files
Sequence Files
Object Files

Text Files

We've already seen various examples in Chapter 1, Architecture and Installation and Chapter 2, Transformations and Actions with Spark RDDs on how to read text files using the textFile() function. Each line in the text file is assumed to be a new record. We've also seen examples of wholeTextFiles(), which return a PairRDD, with the key being the identifier of the file. This is very useful in ETL jobs, where you might want to process data differently based on the key, or even pass that on to downstream processing.

An...

Commonly supported file systems

Until now we have mostly focused on the functional aspects of Spark and hence tried to move away from the discussion of filesystems supported by Spark. You might have seen a couple of examples around HDFS, but the primary focus has been local file systems. However, in production environments, it will be extremely rare that you will be working on a local filesystem and chances are that you will be working with distributed file systems such as HDFS and Amazon S3.

Working with HDFS

Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. HDFS provides the ability to store large amounts of data across commodity hardware and companies are already storing massive amounts of data on HDFS by moving it off their traditional database systems and creating data lakes on Hadoop. Spark allows you to read data from HDFS in a very similar way that you would read from a typical filesystem, with the only...

Structured Data sources and Databases

Spark works with a variety of structured data sources including, but not limited to, the following:

Parquet Files: Apache Parquet is a columnar storage format. More details about the structure of Parquet and how spark makes use of it is available in the Spark SQL chapter.
Hive tables.
JDBC: Spark allows the use of JDBC to connect to a wide variety of databases. Of course the data access via JDBC is relatively slow compared to native database utilities.

We'll cover most of the structured sources in Chapter 4, Spark SQL later in this book.

Working with NoSQL Databases

A NoSQL (originally referring to non SQL, non relational or not only SQL) database provides a mechanism for storage (https://en.wikipedia.org/wiki/Computer_data_storage) and retrieval (https://en.wikipedia.org/wiki/Data_retrieval) of data which is modeled in means other than the tabular relations used in Relational databases (https://en.wikipedia.org/wiki/Relational_database). NoSQL is a relatively...

References

The following valuable references have been used during the chapter, and will aid you in exploring further. Each of the integrations that we have discussed are generally a topic of their own and an entire chapter can be written on them. It would be worthwhile to look deeper into the following reference articles.

Summary

In this chapter, we have covered the basics of ELT, and Spark's ability to interact with a variety of sources including standard text, CSV, TSV, and JSON files. We moved on to look at accessing filesystems including local filesystems, HDFS, and S3. Finally, we spent some time on helping you understand access to a variety of NoSQL databases and the connectors available. As you can see, we have covered a few of the popular systems, but the massive open-source ecosystem around Spark means there are new connectors coming almost on a monthly basis. It is highly recommended to look closely at the project's GitHub page for the latest developments.

We'll now move on to the next chapter, where we are going to focus on Spark SQL, DataFrames, and Datasets. The next chapter is important as it builds on what we have covered already and helps us understand how Spark 2.0 abstracts developers from the relatively complex concept of RDD's by expanding on the already introduced concept of DataFrames...