Reader small image

You're reading from  Learning Spark SQL

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785888359
Edition1st Edition
Languages
Right arrow

Chapter 3. Using Spark SQL for Data Exploration

In this chapter, we will introduce you to using Spark SQL for exploratory data analysis. We will introduce preliminary techniques to compute some basic statistics, identify outliers, and visualize, sample, and pivot data. A series of hands-on exercises in this chapter will enable you to use Spark SQL along with tools such as Apache Zeppelin for developing an intuition about your data.

In this chapter, we shall look at the following topics:

  • What is Exploratory Data Analysis (EDA)
  • Why is EDA important?
  • Using Spark SQL for basic data analysis
  • Visualizing data with Apache Zeppelin
  • Sampling data with Spark SQL APIs
  • Using Spark SQL for creating pivot tables

Introducing Exploratory Data Analysis (EDA)


Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other, more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify the possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that using a more sophisticated model-based analysis is not justified.

Typically, the graphical techniques used in EDA are simple, consisting...

Using Spark SQL for basic data analysis


Interactively, and visualizing large data is as the queries can take a long time to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java and R APIs, and libraries for distributed statistics and machine learning.

For data that fits into a single computer, there are many good tools available, such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data, then this section will offer some good tools and techniques for data exploration.

In this section, we will go through some basic data exploration...

Visualizing data with Apache Zeppelin


Typically, we will many graphs to verify our hunches about the data. A lot of these quick and dirty graphs used during EDA are, ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers cannot typically cannot handle millions of data points. Hence, we have to summarize, sample, or model our data before we can effectively visualize it.

Traditionally, BI tools provided extensive aggregation and features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data was subsequently downloaded and visualized on the practitioner's workstations. Spark can eliminate many of these batch jobs to support interactive data visualization.

In this section, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is...

Sampling data with Spark SQL APIs


Often, we need to visualize data points to the nature of our data. Statisticians use sampling techniques extensively for data analysis. Spark supports both approximate and exact sample generation. Approximate sampling is faster and is often good enough in most cases.

In this section, we will explore Spark SQL APIs used for generating samples. We will work through some examples of generating approximate and exact stratified samples, with and without replacement, using the DataFrame/Dataset API and RDD-based methods.

Sampling with the DataFrame/Dataset API

We can use the sampleBy to create a sample replacement. We can specify the fractions for the percentages of each value to be selected in the sample.

The size of the and the of record of each type are shown here:

Next, we create a sample with replacement that selects a fraction of rows (10% of the total records) using a random seed. Using sample  is not guaranteed to provide the exact fraction of the total...

Using Spark SQL for creating pivot tables


Pivot tables alternate views of your data and are used during data exploration. In the following example, we demonstrate pivoting using Spark DataFrames:

The following example pivots on housing loan taken and computes the numbers by marital status:

In the next example, we create a DataFrame with appropriate column names for the total and average number of calls:

In the following example, we a DataFrame with appropriate names for the total and average duration of calls for each job category:

In the following example, we pivoting to compute average call for each job category, while also specifying a subset of marital status:

The following is the same as the preceding one, except that we the average call duration values by the housing loan field as well in this case:

Next, we how you can create a DataFrame of pivot table of deposits subscribed by month, save it to disk, and read it back into a RDD:

Further, we use the RDD in the preceding step to...

Summary


In this chapter, we demonstrated using Spark SQL for exploring Datasets, performing basic data quality checks, generating samples and pivot tables, and visualizing data with Apache Zeppelin.

In the next chapter, we will shift our focus to data munging/wrangling. We will introduce techniques to handle missing data, bad data, duplicate records, and so on. We will also use extensive hands-on sessions for demonstrating the use of Spark SQL for common data munging tasks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Spark SQL
Published in: Sep 2017Publisher: PacktISBN-13: 9781785888359
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime