Reader small image

You're reading from  Learning Spark SQL

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785888359
Edition1st Edition
Languages
Right arrow

Chapter 4. Using Spark SQL for Data Munging

In this code-intensive chapter, we will present key data munging techniques used to transform raw data to a usable format for analysis. We start with some general data munging steps that are applicable in a wide variety of scenarios. Then, we shift our focus to specific types of data including time-series data, text, and data preprocessing steps for Spark MLlib-based machine learning pipelines. We will use several Datasets to illustrate these techniques.

In this chapter, we shall learn:

  • What is data munging?
  • Explore data munging techniques
  • Combine data using joins
  • Munging on textual data
  • Munging on time-series data
  • Dealing with variable length records
  • Data preparation for machine learning pipelines

Introducing data munging


Raw data is typically messy and requires a of transformations before it becomes useful for modeling and analysis work. Such Datasets can have missing data, duplicate records, corrupted data, incomplete records, and so on. In its simplest form, data munging, or data wrangling, is basically the transformation of raw data into a usable format. In most projects, this is the most challenging and time-consuming step.

However, without data munging your project can reduce to a garbage-in, garbage-out scenario.

Typically, you will execute a bunch of functions and processes such as subset, filter, aggregate, sort, merge, reshape, and so on. In addition, you will also do type conversions, add new fields/columns, rename fields/columns, and so on.

A large project can comprise of several different kinds of data with varying degrees of data quality. There can be a mix of numerical, textual, time-series, structured, and unstructured data including audio and video data used together...

Exploring data munging techniques


In this section, we will introduce several munging techniques using household electric consumption and weather Datasets. The best way to learn these techniques is to practice the various ways to manipulate the data contained in various publically available Datasets (in addition to the ones used here). The more you practice, the better you will get at it. In the process, you will probably evolve your own style, and develop several toolsets and techniques to achieve your munging objectives. At a minimum, you should get very comfortable working with and moving between RDDs, DataFrames, and Datasets, computing counts, distinct counts, and various aggregations to cross-check your results and match your intuitive understanding the Datasets. Additionally, it is also important to develop the ability to make decisions based on the pros and cons of executing any given munging step.

We will attempt to accomplish the following objectives in this section:

  1. Pre-process...

Munging textual data


In this section, we explore data munging techniques for typical analysis situations. Many text-based analyses tasks require computing word counts, removing stop words, stemming, and so on. In addition, we will also explore how you can process multiple files, one at a time, from HDFS directories.

First, we import all the classes that will be used in this section:

Processing multiple input data files

In the next few steps, we initialize a set of variables for defining the directory containing the input files, and an empty RDD. We also create a list of filenames the input HDFS directory. In the following example, we will work with files contained in a single directory; however, the techniques can easily be extended across all 20 newsgroup sub-directories.

Next, we write a function to compute the word counts for each file and collect the results in an ArrayBuffer:

We have included a print statement to display the file names as they are picked up for processing, as follows:

We...

Munging time series data


Time series data is a sequence of linked to a timestamp. In section, we use Cloudera's spark-ts package for analyzing time-series data.

Note

Refer to Cloudera Engineering Blog, A New Library for Analyzing Time-Series Data with Apache Spark, for more details on time-series data and its processing using spark-ts. This blog is available at: https://github.com/sryza/spark-timeseries.

The spark-ts package can be downloaded and using instructions available at:

https://github.com/sryza/spark-timeseries.

We will attempt to accomplish the following objectives in the following sub-sections:

  • Pre-processing of the time-series Dataset
  • Processing date fields
  • Persisting and loading data
  • Defining a date-time index
  • Using the  TimeSeriesRDD object
  • Handling missing time-series data
  • Computing basic statistics

For this section, specify inclusion of the spark-ts.jar file while starting the Spark shell as shown:

We download Datasets containing pricing and volume data for six stocks over a one year...

Dealing with variable length records


In this section, we will explore a way of dealing with length records. Our approach essentially converts each of the rows to a fixed length record equal to the maximum length record. In our example, as each row represents a portfolio and there is no unique identifier, this method is useful for manipulating data into the familiar fixed length records case. We will generate the requisite number of fields to equal the maximum number of stocks in the largest portfolio. This will lead to empty fields where the number of stocks is less than the maximum number of stocks in any portfolio. Another way to deal with variable length records is to use the explode() function to create new rows for each stock in a given portfolio (for an example of using the explode() function, refer Chapter 9Developing Applications with Spark SQL).

To avoid repeating all the steps from previous examples to read in all the files, we have combined the data into a single input file...

Preparing data for machine learning


In this section, we introduce the of preparing the data prior to applying Spark MLlib algorithms. Typically, we need to have two columns called label and features for using Spark MLlib classification algorithms. We will illustrate this with the following example described:

We import the required classes for this section:

scala> import org.apache.spark.ml.Pipeline
scala> import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
scala> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
scala> import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer} 
scala> import org.apache.spark.ml.linalg.Vectors 

Pre-processing data for machine learning

We define a set of UDFs used in this section. These include, for example, checking whether a string contains a specific substring or not, and returning a 0.0 or 1.0 value to the label column. Another UDF is used to create...

Summary


In this chapter, we explored using Spark SQL for performing some basic data munging/wrangling tasks. We covered munging textual data, working with variable length records, extracting data from "messy" columns, combining data using JOIN, and preparing data for machine learning applications. In addition, we used spark-ts library to work with time-series data.

In the next chapter, we will shift our focus to Spark Streaming applications. We will introduce you to using Spark SQL in such applications. We will also include extensive hands-on sessions for demonstrating the use of Spark SQL in implementing the common use cases in Spark Streaming applications.

 

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Spark SQL
Published in: Sep 2017Publisher: PacktISBN-13: 9781785888359
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime