Essential PySpark for Scalable Data Analytics

By Sreeram Nudurupati
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Chapter 1: Distributed Computing Primer
About this book

Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework.

Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that help you gain insights faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas.

By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.

Publication date:
October 2021
Publisher
Packt
Pages
322
ISBN
9781800568877

 

Chapter 1: Distributed Computing Primer

This chapter introduces you to the Distributed Computing paradigm and shows you how Distributed Computing can help you to easily process very large amounts of data. You will learn about the concept of Data Parallel Processing using the MapReduce paradigm and, finally, learn how Data Parallel Processing can be made more efficient by using an in-memory, unified data processing engine such as Apache Spark.

Then, you will dive deeper into the architecture and components of Apache Spark along with code examples. Finally, you will get an overview of what's new with the latest 3.0 release of Apache Spark.

In this chapter, the key skills that you will acquire include an understanding of the basics of the Distributed Computing paradigm and a few different implementations of the Distributed Computing paradigm such as MapReduce and Apache Spark. You will learn about the fundamentals of Apache Spark along with its architecture and core components, such as the Driver, Executor, and Cluster Manager, and how they come together as a single unit to perform a Distributed Computing task. You will learn about Spark's Resilient Distributed Dataset (RDD) API along with higher-order functions and lambdas. You will also gain an understanding of the Spark SQL Engine and its DataFrame and SQL APIs. Additionally, you will implement working code examples. You will also learn about the various components of an Apache Spark data processing program, including transformations and actions, and you will learn about the concept of Lazy Evaluation.

In this chapter, we're going to cover the following main topics:

  • Introduction Distributed Computing
  • Distributed Computing with Apache Spark
  • Big data processing with Spark SQL and DataFrames
 

Technical requirements

In this chapter, we will be using the Databricks Community Edition to run our code. This can be found at https://community.cloud.databricks.com.

Sign-up instructions can be found at https://databricks.com/try-databricks.

The code used in this chapter can be downloaded from https://github.com/PacktPublishing/Essential-PySpark-for-Data-Analytics/tree/main/Chapter01.

The datasets used in this chapter can be found at https://github.com/PacktPublishing/Essential-PySpark-for-Data-Analytics/tree/main/data.

The original datasets can be taken from their sources, as follows:

 

Distributed Computing

In this section, you will learn about Distributed Computing, the need for it, and how you can use it to process very large amounts of data in a quick and efficient manner.

Introduction to Distributed Computing

Distributed Computing is a class of computing techniques where we use a group of computers as a single unit to solve a computational problem instead of just using a single machine.

In data analytics, when the amount of data becomes too large to fit in a single machine, we can either split the data into smaller chunks and process it on a single machine iteratively, or we can process the chunks of data on several machines in parallel. While the former gets the job done, it might take longer to process the entire dataset iteratively; the latter technique gets the job completed in a shorter period of time by using multiple machines at once.

There are different kinds of Distributed Computing techniques; however, for data analytics, one popular technique is Data Parallel Processing.

Data Parallel Processing

Data Parallel Processing involves two main parts:

  • The actual data that needs to be processed
  • The piece of code or business logic that needs to be applied to the data in order to process it

We can process large amounts of data by splitting it into smaller chunks and processing them in parallel on several machines. This can be done in two ways:

  • First, bring the data to the machine where our code is running.
  • Second, take our code to where our data is actually stored.

One drawback of the first technique is that as our data sizes become larger, the amount of time it takes to move data also increases proportionally. Therefore, we end up spending more time moving data from one system to another and, in turn, negating any efficiency gained by our parallel processing system. We also find ourselves creating multiple copies of data during data replication.

The second technique is far more efficient because instead of moving large amounts of data, we can easily move a few lines of code to where our data actually resides. This technique of moving code to where the data resides is referred to as Data Parallel Processing. This Data Parallel Processing technique is very fast and efficient, as we save the amount of time that was needed earlier to move and copy data across different systems. One such Data Parallel Processing technique is called the MapReduce paradigm.

Data Parallel Processing using the MapReduce paradigm

The MapReduce paradigm breaks down a Data Parallel Processing problem into three main stages:

  • The Map stage
  • The Shuffle stage
  • The Reduce stage

The Map stage takes the input dataset, splits it into (key, value) pairs, applies some processing on the pairs, and transforms them into another set of (key, value) pairs.

The Shuffle stage takes the (key, value) pairs from the Map stage and shuffles/sorts them so that pairs with the same key end up together.

The Reduce stage takes the resultant (key, value) pairs from the Shuffle stage and reduces or aggregates the pairs to produce the final result.

There can be multiple Map stages followed by multiple Reduce stages. However, a Reduce stage only starts after all of the Map stages have been completed.

Let's take a look at an example where we want to calculate the counts of all the different words in a text document and apply the MapReduce paradigm to it.

The following diagram shows how the MapReduce paradigm works in general:

Figure 1.1 – Calculating the word count using MapReduce

Figure 1.1 – Calculating the word count using MapReduce

The previous example works in the following manner:

  1. In Figure 1.1, we have a cluster of three nodes, labeled M1, M2, and M3. Each machine includes a few text files containing several sentences in plain text. Here, our goal is to use MapReduce to count all of the words in the text files.
  2. We load all the text documents onto the cluster; each machine loads the documents that are local to it.
  3. The Map Stage splits the text files into individual lines and further splits each line into individual words. Then, it assigns each word a count of 1 to create a (word, count) pair.
  4. The Shuffle Stage takes the (word, count) pairs from the Map stage and shuffles/sorts them so that word pairs with the same keyword end up together.
  5. The Reduce Stage groups all keywords together and sums their counts to produce the final count of each individual word.

The MapReduce paradigm was popularized by the Hadoop framework and was pretty popular for processing big data workloads. However, the MapReduce paradigm offers a very low-level API for transforming data and requires users to have proficient knowledge of programming languages such as Java. Expressing a data analytics problem using Map and Reduce is not very intuitive or flexible.

MapReduce was designed to run on commodity hardware, and since commodity hardware was prone to failures, resiliency to hardware failures was a necessity. MapReduce achieves resiliency to hardware failures by saving the results of every stage to disk. This round-trip to disk after every stage makes MapReduce relatively slow at processing data because of the slow I/O performance of physical disks in general. To overcome this limitation, the next generation of the MapReduce paradigm was created, which made use of much faster system memory, as opposed to disks, to process data and offered much more flexible APIs to express data transformations. This new framework is called Apache Spark, and you will learn about it in the next section and throughout the remainder of this book.

Important note

In Distributed Computing, you will often encounter the term cluster. A cluster is a group of computers all working together as a single unit to solve a computing problem. The primary machine of a cluster is, typically, termed the Master Node, which takes care of the orchestration and management of the cluster, and secondary machines that actually carry out task execution are called Worker Nodes. A cluster is a key component of any Distributed Computing system, and you will encounter these terms throughout this book.

 

Distributed Computing with Apache Spark

Over the last decade, Apache Spark has grown to be the de facto standard for big data processing. Indeed, it is an indispensable tool in the hands of anyone involved with data analytics.

Here, we will begin with the basics of Apache Spark, including its architecture and components. Then, we will get started with the PySpark programming API to actually implement the previously illustrated word count problem. Finally, we will take a look at what's new with the latest 3.0 release of Apache Spark.

Introduction to Apache Spark

Apache Spark is an in-memory, unified data analytics engine that is relatively fast compared to other distributed data processing frameworks.

It is a unified data analytics framework because it can process different types of big data workloads with a single engine. The different workloads include the following

  • Batch data processing
  • Real-time data processing
  • Machine learning and data science

Typically, data analytics involves all or a combination of the previously mentioned workloads to solve a single business problem. Before Apache Spark, there was no single framework that could accommodate all three workloads simultaneously. With Apache Spark, various teams involved in data analytics can all use a single framework to solve a single business problem, thus improving communication and collaboration among teams and drastically reducing their learning curve.

We will explore each of the preceding workloads, in depth, in Chapter 2, Data Ingestion through to Chapter 8, Unsupervised Machine Learning, of this book.

Further, Apache Spark is fast in two aspects:

  • It is fast in terms of data processing speed.
  • It is fast in terms of development speed.

Apache Spark has fast job/query execution speeds because it does all of its data processing in memory, and it has other optimizations techniques built-in such as Lazy Evaluation, Predicate Pushdown, and Partition Pruning to name a few. We will go over Spark's optimization techniques in the coming chapters.

Secondly, Apache Spark provides developers with very high-level APIs to perform basic data processing operations such as filtering, grouping, sorting, joining, and aggregating. By using these high-level programming constructs, developers can very easily express their data processing logic, making their development many times faster.

The core abstraction of Apache Spark, which makes it fast and very expressive for data analytics, is called an RDD. We will cover this in the next section.

Data Parallel Processing with RDDs

An RDD is the core abstraction of the Apache Spark framework. Think of an RDD as any kind of immutable data structure that is typically found in a programming language but one that resides in the memory of several machines instead of just one. An RDD consists of partitions, which are logical divisions of an RDD, with a few of them residing on each machine.

The following diagram helps explain the concepts of an RDD and its partitions:

Figure 1.2 – An RDD

Figure 1.2 – An RDD

In the previous diagram, we have a cluster of three machines or nodes. There are three RDDs on the cluster, and each RDD is divided into partitions. Each node of the cluster contains a few partitions of an individual RDD, and each RDD is distributed among several nodes of the cluster by means of partitions.

The RDD abstractions are accompanied by a set of high-level functions that can operate on the RRDs in order to manipulate the data stored within the partitions. These functions are called higher-order functions, and you will learn about them in the following section.

Higher-order functions

Higher-order functions manipulate RDDs and help us write business logic to transform data stored within the partitions. Higher-order functions accept other functions as parameters, and these inner functions help us define the actual business logic that transforms data and is applied to each partition of the RDD in parallel. These inner functions passed as parameters to the higher-order functions are called lambda functions or lambdas.

Apache Spark comes with several higher-order functions such as map, flatMap, reduce, fold, filter, reduceByKey, join, and union to name a few. These functions are high-level functions and help us express our data manipulation logic very easily.

For example, consider our previously illustrated word count example. Let's say you wanted to read a text file as an RDD and split each word based on a delimiter such as a whitespace. This is what code expressed in terms of an RDD and higher-order function would look like:

lines = sc.textFile("/databricks-datasets/README.md")
words = lines.flatMap(lambda s: s.split(" "))
word_tuples = words.map(lambda s: (s, 1))

In the previous code snippet, the following occurs:

  1. We are loading a text file using the built-in sc.textFile() method, which loads all text files at the specified location into the cluster memory, splits them into individual lines, and returns an RDD of lines or strings.
  2. We then apply the flatMap() higher-order function to the new RDD of lines and supply it with a function that instructs it to take each line and split it based on a white space. The lambda function that we pass to flatMap() is simply an anonymous function that takes one parameter, an individual line of StringType, and returns a list of words. Through the flatMap() and lambda() functions, we are able to transform an RDD of lines into an RDD of words.
  3. Finally, we use the map() function to assign a count of 1 to every individual word. This is pretty easy and definitely more intuitive compared to developing a MapReduce application using the Java programming language.

To summarize what you have learned, the primary construct of the Apache Spark framework is an RDD. An RDD consists of partitions distributed across individual nodes of a cluster. We use special functions called higher-order functions to operate on the RDDs and transform the RDDs according to our business logic. This business logic is passed along to the Worker Nodes via higher-order functions in the form of lambdas or anonymous functions.

Before we dig deeper into the inner workings of higher-order functions and lambda functions, we need to understand the architecture of the Apache Spark framework and the components of a typical Spark Cluster. We will do this in the following section.

Note

The Resilient part of an RDD comes from the fact that every RDD knows its lineage. At any given point in time, an RDD has information of all the individual operations performed on it, going back all the way to the data source itself. Thus, if any Executors are lost due to any failures and one or more of its partitions are lost, it can easily recreate those partitions from the source data making use of the lineage information, thus making it Resilient to failures.

Apache Spark cluster architecture

A typical Apache Spark cluster consists of three major components, namely, the Driver, a few Executors, and the Cluster Manager:

Figure 1.3 – Apache Spark Cluster components

Figure 1.3 – Apache Spark Cluster components

Let's examine each of these components a little closer.

Driver – the heart of a Spark application

The Spark Driver is a Java Virtual Machine process and is the core part of a Spark application. It is responsible for user application code declarations, along with the creation of RDDs, DataFrames, and datasets. It is also responsible for coordinating with and running code on the Executors and creating and scheduling tasks on the Executors. It is even responsible for relaunching Executors after a failure and finally returning any data requested back to the client or the user. Think of a Spark Driver as the main() program of any Spark application.

Important note

The Driver is the single point of failure for a Spark cluster, and the entire Spark application fails if the driver fails; therefore, different Cluster Managers implement different strategies to make the Driver highly available.

Executors – the actual workers

Spark Executors are also Java Virtual Machine processes, and they are responsible for running operations on RDDs that actually transform data. They can cache data partitions locally and return the processed data back to the Driver or write to persistent storage. Each Executor runs operations on a set of partitions of an RDD in parallel.

Cluster Manager – coordinates and manages cluster resources

The Cluster Manager is a process that runs centrally on the cluster and is responsible for providing resources requested by the Driver. It also monitors the Executors regarding task progress and their status. Apache Spark comes with its own Cluster Manager, which is referred to as the Standalone Cluster Manager, but it also supports other popular Cluster Managers such as YARN or Mesos. Throughout this book, we will be using Spark's built-in Standalone Cluster Manager.

Getting started with Spark

So far, we have learnt about Apache Spark's core data structure, called RDD, the functions used to manipulate RDDs, called higher-order functions, and the components of an Apache Spark cluster. You have also seen a few code snippets on how to use higher-order functions.

In this section, you will put your knowledge to practical use and write your very first Apache Spark program, where you will use Spark's Python API called PySpark to create a word count application. However, first, we need a few things to get started:

  • An Apache Spark cluster
  • Datasets
  • Actual code for the word count application

We will use the free Community Edition of Databricks to create our Spark cluster. The code used can be found via the GitHub link that was mentioned at the beginning of this chapter. The links for the required resources can be found in the Technical requirements section toward the beginning of the chapter.

Note

Although we are using Databricks Spark Clusters in this book, the provided code can be executed on any Spark cluster running Spark 3.0, or higher, as long as data is provided at a location accessible by your Spark cluster.

Now that you have gained an understanding of Spark's core concepts such as RDDs, higher-order functions, lambdas, and Spark's architecture, let's implement your very first Spark application using the following code:

lines = sc.textFile("/databricks-datasets/README.md")
words = lines.flatMap(lambda s: s.split(" "))
word_tuples = words.map(lambda s: (s, 1))
word_count = word_tuples.reduceByKey(lambda x, y:  x + y)
word_count.take(10)
word_count.saveAsTextFile("/tmp/wordcount.txt")

In the previous code snippet, the following takes place:

  1. We load a text file using the built-in sc.textFile() method, which reads all of the text files at the specified location, splits them into individual lines, and returns an RDD of lines or strings.
  2. Then, we apply the flatMap() higher-order function to the RDD of lines and supply it with a function that instructs it to take each line and split it based on a white space. The lambda function that we pass to flatMap() is simply an anonymous function that takes one parameter, a line, and returns individual words as a list. By means of the flatMap() and lambda() functions, we are able to transform an RDD of lines into an RDD of words.
  3. Then, we use the map() function to assign a count of 1 to every individual word.
  4. Finally, we use the reduceByKey() higher-order function to sum up the count of similar words occurring multiple times.
  5. Once the counts have been calculated, we make use of the take() function to display a sample of the final word counts.
  6. Although displaying a sample result set is usually helpful in determining the correctness of our code, in a big data setting, it is not practical to display all the results on to the console. So, we make use of the saveAsTextFile() function to persist our finals results in persistent storage.

    Important note

    It is not recommended that you display the entire result set onto the console using commands such as take() or collect(). It could even be outright dangerous to try and display all the data in a big data setting, as it could try to bring way too much data back to the driver and cause the driver to fail with an OutOfMemoryError, which, in turn, causes the entire application to fail.

    Therefore, it is recommended that you use take() with a very small result set, and use collect() only when you are confident that the amount of data returned is, indeed, very small.

Let's dive deeper into the following line of code in order to understand the inner workings of lambdas and how they implement Data Parallel Processing along with higher-order functions:

words = lines.flatMap(lambda s: s.split(" "))

In the previous code snippet, the flatMmap() higher-order function bundles the code present in the lambda and sends it over a network to the Worker Nodes, using a process called serialization. This serialized lambda is then sent out to every executor, and each executor, in turn, applies this lambda to individual RDD partitions in parallel.

Important note

Since higher-order functions need to be able to serialize the lambdas in order to send your code to the Executors. The lambda functions need to be serializable, and failing this, you might encounter a Task not serializable error.

In summary, higher-order functions are, essentially, transferring your data transformation code in the form of serialized lambdas to your data in RDD partitions. Therefore, instead of moving data to where the code is, we are actually moving our code to where data is situated, which is the exact definition of Data Parallel Processing, as we learned earlier in this chapter.

Thus, Apache Spark along with its RDDs and higher-order functions implements an in-memory version of the Data Parallel Processing paradigm. This makes Apache Spark fast and efficient at big data processing in a Distributed Computing setting.

The RDD abstraction of Apache Spark definitely offers a higher level of programming API compared to MapReduce, but it still requires some level of comprehension of the functional programming style to be able to express even the most common types of data transformations. To overcome this challenge, Spark's already existing SQL engine was expanded, and another abstraction, called the DataFrame, was added on top of RDDs. This makes data processing much easier and more familiar for data scientists and data analysts. The following section will explore the DataFrame and SQL API of the Spark SQL engine.

 

Big data processing with Spark SQL and DataFrames

The Spark SQL engine supports two types of APIs, namely, DataFrame and Spark SQL. Being higher-level abstractions than RDDs, these are far more intuitive and even more expressive. They come with many more data transformation functions and utilities that you might already be familiar with as a data engineer, data analyst, or a data scientist.

Spark SQL and DataFrame APIs offer a low barrier to entry into big data processing. They allow you to use your existing knowledge and skills of data analytics and allow you to easily get started with Distributed Computing. They help you get started with processing data at scale, without having to deal with any of the complexities that typically come along with Distributed Computing frameworks.

In this section, you will learn how to use both DataFrame and Spark SQL APIs to get started with your scalable data processing journey. Notably, the concepts learned here will be useful and are required throughout this book.

Transforming data with Spark DataFrames

Starting with Apache Spark 1.3, the Spark SQL engine was added as a layer on top of the RDD API and expanded to every component of Spark, to offer an even easier to use and familiar API for developers. Over the years, the Spark SQL engine and its DataFrame and SQL APIs have grown to be even more robust and have become the de facto and recommended standard for using Spark in general. Throughout this book, you will be exclusively using either DataFrame operations or Spark SQL statements for all your data processing needs, and you will rarely ever use the RDD API.

Think of a Spark DataFrame as a Pandas DataFrame or a relational database table with rows and named columns. The only difference is that a Spark DataFrame resides in the memory of several machines instead of a single machine. The following diagram shows a Spark DataFrame with three columns distributed across three worker machines:

Figure 1.4 – A distributed DataFrame

Figure 1.4 – A distributed DataFrame

A Spark DataFrame is also an immutable data structure such as an RDD, consisting of rows and named columns, where each individual column can be of any type. Additionally, DataFrames come with operations that allow you to manipulate data, and we generally refer to these set of operations as Domain Specific Language (DSL). Spark DataFrame operations can be grouped into two main categories, namely, transformations and actions, which we will explore in the following sections.

One advantage of using DataFrames or Spark SQL over the RDD API is that the Spark SQL engine comes with a built-in query optimizer called Catalyst. This Catalyst optimizer analyzes user code, along with any available statistics on the data, to generate the best possible execution plan for the query. This query plan is further converted into Java bytecode, which runs natively inside the Executor's Java JVM. This happens irrespective of the programming language used, thus making any code processed via the Spark SQL engine equally performant in most cases, whether it be written using Scala, Java, Python, R, or SQL.

Transformations

Transformations are operations performed on DataFrames that manipulate the data in the DataFrame and result in another DataFrame. Some examples of transformations are read, select, where, filter, join, and groupBy.

Actions

Actions are operations that actually cause a result to be calculated and either printed onto the console or, more practically, written back to a storage location. Some examples of actions include write, count, and show.

Lazy evaluation

Spark transformations are lazily evaluated, which means that transformations are not evaluated immediately as they are declared, and data is not manifested in memory until an action is called. This has a few advantages, as it gives the Spark optimizer an opportunity to evaluate all of your transformations until an action is called and generate the most optimal plan of execution to get the best performance and efficiency out of your code.

The advantage of Lazy Evaluation coupled with Spark's Catalyst optimizer is that you can solely focus on expressing your data transformation logic and not worry too much about arranging your transformations in a specific order to get the best performance and efficiency out of your code. This helps you be more productive at your tasks and not become perplexed by the complexities of a new framework.

Important note

Compared to Pandas DataFrames, Spark DataFrames are not manifested in memory as soon as they are declared. They are only manifested in memory when an action is called. Similarly, DataFrame operations don't necessarily run in the order you specified them to, as Spark's Catalyst optimizer generates the best possible execution plan for you, sometimes even combining a few operations into a single unit.

Let's take the word count example that we previously implemented using the RDD API and try to implement it using the DataFrame DSL:

from pyspark.sql.functions import split, explode
linesDf = spark.read.text("/databricks-datasets/README.md")
wordListDf = linesDf.select(split("value", " ").alias("words"))
wordsDf = wordListDf.select(explode("words").alias("word"))
wordCountDf = wordsDf.groupBy("word").count()
wordCountDf.show()
wordCountDf.write.csv("/tmp/wordcounts.csv")

In the previous code snippet, the following occurs:

  1. First, we import a few functions from the PySpark SQL function library, namely, split and explode.
  2. Then, we read text using the SparkSession read.text() method, which creates a DataFrame of lines of StringType.
  3. We then use the split() function to separate out every line into its individual words; the result is a DataFrame with a single column, named value, which is actually a list of words.
  4. Then, we use the explode() function to separate the list of words in each row out to every word on a separate row; the result is a DataFrame with a column labeled word.
  5. Now we are finally ready to count our words, so we group our words by the word column and count individual occurrences of each word. The final result is a DataFrame of two columns, that is, the actual word and its count.
  6. We can view a sample of the result using the show() function, and, finally, save our results in persistent storage using the write() function.

Can you guess which operations are actions? If you guessed show() or write(), then you are correct. Every other function, including select() and groupBy(), are transformations and will not induce the Spark job into action.

Note

Although the read() function is a transformation, sometimes, you will notice that it will actually execute a Spark job. The reason for this is that with certain structured and semi-structured data formats, Spark will try and infer the schema information from the underlying files and will process a small subset of the actual files to do this.

Using SQL on Spark

SQL is an expressive language for ad hoc data exploration and business intelligence types of queries. Because it is a very high-level declarative programming language, the user can simply focus on the input and output and what needs to be done to the data and not worry too much about the programming complexities of how to actually implement the logic. Apache Spark's SQL engine also has a SQL language API along with the DataFrame and Dataset APIs.

With Spark 3.0, Spark SQL is now compliant with ANSI standards, so if you are a data analyst who is familiar with another SQL-based platform, you should be able to get started with Spark SQL with minimal effort.

Since DataFrames and Spark SQL utilize the same underlying Spark SQL engine, they are completely interchangeable, and it is often the case that users intermix DataFrame DSL with Spark SQL statements for parts of the code that are expressed easily with SQL.

Now, let's rewrite our word count program using Spark SQL. First, we create a table specifying our text file to be a CSV file with a white space as the delimiter, a neat trick to read each line of the text file, and also split each file into individual words all at once:

CREATE TABLE word_counts (word STRING)
USING csv
OPTIONS("delimiter"=" ")
LOCATION "/databricks-datasets/README.md"

Now that we have a table of a single column of words, we just need to GROUP BY the word column and do a COUNT() operation to get our word counts:

SELECT word, COUNT(word) AS count
FROM word_counts
GROUP BY word

Here, you can observe that solving the same business problem became progressively easier from using MapReduce to RRDs, to DataFrames and Spark SQL. With each new release, Apache Spark has been adding many higher-level programming abstractions, data transformation and utility functions, and other optimizations. The goal has been to enable data engineers, data scientists, and data analysts to focus their time and energy on solving the actual business problem at hand and not worry about complex programming abstractions or system architectures.

Apache Spark's latest major release of version 3 has many such enhancements that make the life of a data analytics professional much easier. We will discuss the most prominent of these enhancements in the following section.

What's new in Apache Spark 3.0?

There are many new and notable features in Apache Spark 3.0; however, only a few are mentioned here, which you will find very useful during the beginning phases of your data analytics journey:

  • Speed: Apache Spark 3.0 is orders of magnitude faster than its predecessors. Third-party benchmarks have put Spark 3.0 to be anywhere from 2 to 17 times faster for certain types of workloads.
  • Adaptive Query Execution: The Spark SQL engine generates a few logical and physical query execution plans based on user code and any previously collected statistics on the source data. Then, it tries to choose the most optimal execution plan. However, sometimes, Spark is not able to generate the best possible execution plan either because the statistics are either stale or non-existent, leading to suboptimal performance. With adaptive query execution, Spark is able to dynamically adjust the execution plan during runtime and give the best possible query performance.
  • Dynamic Partition Pruning: Business intelligence systems and data warehouses follow a data modeling technique called Dimensional Modeling, where data is stored in a central fact table surrounded by a few dimensional tables. Business intelligence types of queries utilizing these dimensional models involve queries with multiple joins between the dimension and fact tables, along with various filter conditions on the dimension tables. With dynamic partition pruning, Spark is able to filter out any fact table partitions based on the filters applied on these dimensions, resulting in less data being read into the memory, which, in turn, results in better query performance.
  • Kubernetes Support: Earlier, we learned that Spark comes with its own Standalone Cluster Manager and can also work with other popular resource managers such as YARN and Mesos. Now Spark 3.0 natively supports Kubernetes, which is a popular open source framework for running and managing parallel container services.
 

Summary

In this chapter, you learned the concept of Distributed Computing. We discovered why Distributed Computing has become very important, as the amount of data being generated is growing rapidly, and it is not practical or feasible to process all your data using a single specialist system.

You then learned about the concept of Data Parallel Processing and reviewed a practical example of its implementation by means of the MapReduce paradigm.

Then, you were introduced to an in-memory, unified analytics engine called Apache Spark, and learned how fast and efficient it is for data processing. Additionally, you learned it is very intuitive and easy to get started for developing data processing applications. You also got to understand the architecture and components of Apache Spark and how they come together as a framework.

Next, you came to understand RDDs, which are the core abstraction of Apache Spark, how they store data on a cluster of machines in a distributed manner, and how you can leverage higher-order functions along with lambda functions to implement Data Parallel Processing via RDDs.

You also learned about the Spark SQL engine component of Apache Spark, how it provides a higher level of abstraction than RRDs, and that it has several built-in functions that you might already be familiar with. You learned to leverage the DataFrame DSL to implement your data processing business logic in an easier and more familiar way. You also learned about Spark's SQL API, how it is ANSI SQL standards-compliant, and how it allows you to seamlessly perform SQL analytics on large amounts of data efficiently.

You also came to know some of the prominent improvements in Apache Spark 3.0, such as adaptive query execution and dynamic partition pruning, which help make Spark 3.0 much faster in performance than its predecessors.

Now that you have learned the basics of big data processing with Apache Spark, you are ready to embark on a data analytics journey using Spark. A typical data analytics journey starts with acquiring raw data from various source systems, ingesting it into a historical storage component such as a data warehouse or a data lake, then transforming the raw data by cleansing, integrating, and transforming it to get a single source of truth. Finally, you can gain actionable business insights through clean and integrated data, leveraging descriptive and predictive analytics. We will cover each of these aspects in the subsequent chapters of this book, starting with the process of data cleansing and ingestion in the following chapter.

About the Author
  • Sreeram Nudurupati

    Sreeram Nudurupati is a data analytics professional with years of experience in designing and optimizing data analytics pipelines at scale. He has a history of helping enterprises, as well as digital natives, build optimized analytics pipelines by using the knowledge of the organization, infrastructure environment, and current technologies.

    Browse publications by this author
Latest Reviews (1 reviews total)
excellent content and information
Essential PySpark for Scalable Data Analytics
Unlock this book and the full library FREE for 7 days
Start now