Packt+ | Advance your knowledge in tech

You're reading from Learning Spark SQL

Product typeBook

Published inSep 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781785888359

Edition1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Data Streaming

Author (1)

Aurobindo Sarkar

Chapter 4. Using Spark SQL for Data Munging

In this code-intensive chapter, we will present key data munging techniques used to transform raw data to a usable format for analysis. We start with some general data munging steps that are applicable in a wide variety of scenarios. Then, we shift our focus to specific types of data including time-series data, text, and data preprocessing steps for Spark MLlib-based machine learning pipelines. We will use several Datasets to illustrate these techniques.

In this chapter, we shall learn:

What is data munging?
Explore data munging techniques
Combine data using joins
Munging on textual data
Munging on time-series data
Dealing with variable length records
Data preparation for machine learning pipelines

Introducing data munging

Raw data is typically messy and requires a of transformations before it becomes useful for modeling and analysis work. Such Datasets can have missing data, duplicate records, corrupted data, incomplete records, and so on. In its simplest form, data munging, or data wrangling, is basically the transformation of raw data into a usable format. In most projects, this is the most challenging and time-consuming step.

However, without data munging your project can reduce to a garbage-in, garbage-out scenario.

Typically, you will execute a bunch of functions and processes such as subset, filter, aggregate, sort, merge, reshape, and so on. In addition, you will also do type conversions, add new fields/columns, rename fields/columns, and so on.

A large project can comprise of several different kinds of data with varying degrees of data quality. There can be a mix of numerical, textual, time-series, structured, and unstructured data including audio and video data used together...

Exploring data munging techniques

In this section, we will introduce several munging techniques using household electric consumption and weather Datasets. The best way to learn these techniques is to practice the various ways to manipulate the data contained in various publically available Datasets (in addition to the ones used here). The more you practice, the better you will get at it. In the process, you will probably evolve your own style, and develop several toolsets and techniques to achieve your munging objectives. At a minimum, you should get very comfortable working with and moving between RDDs, DataFrames, and Datasets, computing counts, distinct counts, and various aggregations to cross-check your results and match your intuitive understanding the Datasets. Additionally, it is also important to develop the ability to make decisions based on the pros and cons of executing any given munging step.

We will attempt to accomplish the following objectives in this section:

Pre-process...

Munging textual data

In this section, we explore data munging techniques for typical analysis situations. Many text-based analyses tasks require computing word counts, removing stop words, stemming, and so on. In addition, we will also explore how you can process multiple files, one at a time, from HDFS directories.

First, we import all the classes that will be used in this section:

Processing multiple input data files

In the next few steps, we initialize a set of variables for defining the directory containing the input files, and an empty RDD. We also create a list of filenames the input HDFS directory. In the following example, we will work with files contained in a single directory; however, the techniques can easily be extended across all 20 newsgroup sub-directories.

Next, we write a function to compute the word counts for each file and collect the results in an ArrayBuffer:

We have included a print statement to display the file names as they are picked up for processing, as follows:

We...

Munging time series data

Time series data is a sequence of linked to a timestamp. In section, we use Cloudera's spark-ts package for analyzing time-series data.

Note

Refer to Cloudera Engineering Blog, A New Library for Analyzing Time-Series Data with Apache Spark, for more details on time-series data and its processing using spark-ts. This blog is available at: https://github.com/sryza/spark-timeseries.

The spark-ts package can be downloaded and using instructions available at:

https://github.com/sryza/spark-timeseries.

We will attempt to accomplish the following objectives in the following sub-sections:

Pre-processing of the time-series Dataset
Processing date fields
Persisting and loading data
Defining a date-time index
Using the TimeSeriesRDD object
Handling missing time-series data
Computing basic statistics

For this section, specify inclusion of the spark-ts.jar file while starting the Spark shell as shown:

We download Datasets containing pricing and volume data for six stocks over a one year...

Dealing with variable length records

In this section, we will explore a way of dealing with length records. Our approach essentially converts each of the rows to a fixed length record equal to the maximum length record. In our example, as each row represents a portfolio and there is no unique identifier, this method is useful for manipulating data into the familiar fixed length records case. We will generate the requisite number of fields to equal the maximum number of stocks in the largest portfolio. This will lead to empty fields where the number of stocks is less than the maximum number of stocks in any portfolio. Another way to deal with variable length records is to use the explode() function to create new rows for each stock in a given portfolio (for an example of using the explode() function, refer Chapter 9, Developing Applications with Spark SQL).

To avoid repeating all the steps from previous examples to read in all the files, we have combined the data into a single input file...

Preparing data for machine learning

In this section, we introduce the of preparing the data prior to applying Spark MLlib algorithms. Typically, we need to have two columns called label and features for using Spark MLlib classification algorithms. We will illustrate this with the following example described:

We import the required classes for this section:

scala> import org.apache.spark.ml.Pipeline
scala> import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
scala> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
scala> import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer} 
scala> import org.apache.spark.ml.linalg.Vectors

Pre-processing data for machine learning

We define a set of UDFs used in this section. These include, for example, checking whether a string contains a specific substring or not, and returning a 0.0 or 1.0 value to the label column. Another UDF is used to create...

Summary

In this chapter, we explored using Spark SQL for performing some basic data munging/wrangling tasks. We covered munging textual data, working with variable length records, extracting data from "messy" columns, combining data using JOIN, and preparing data for machine learning applications. In addition, we used spark-ts library to work with time-series data.

In the next chapter, we will shift our focus to Spark Streaming applications. We will introduce you to using Spark SQL in such applications. We will also include extensive hands-on sessions for demonstrating the use of Spark SQL in implementing the common use cases in Spark Streaming applications.

The rest of the chapter is locked

You have been reading a chapter from

Learning Spark SQL

Published in: Sep 2017Publisher: PacktISBN-13: 9781785888359

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Aurobindo Sarkar

Other recommended products

Related to this chapter

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

BookFeb 2017274 pages

Hands-On Big Data Analytics with PySpark

In this book, you'll learn to implement some practical and proven techniques to improve aspects of programming and administration in Apache Spark. Techniques are demonstrated using practical examples and best practices. You will also learn how to use Spark and its Python API to create performant analytics with large-scale data.

BookMar 2019182 pages

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

BookMay 2019298 pages

Machine Learning with Spark

Spark ML is the machine learning module of Spark. It uses in-memory RDDs to process machine learning models faster for clustering, classification, and regression.

BookApr 2017532 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Mastering Apache Spark 2.x

Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and more. This book will familiarize you with the newest features in Apache Spark 2.x, and take you through an exciting journey of complex Big Data processing, analytics, streaming analytics as well as advanced machine learning with Apache Spark. During the course of the book, you will leverage different functionalities and modules of Apache Spark such as Spark SQL, Spark MLlib, Spark Streaming, SparkML and more, to build efficient data processing solutions. By the end of this book, you will have all the necessary knowledge to use Apache Spark effectively in your day to day tasks.

BookJul 2017354 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

Modern Scala Projects

Scala is a multipurpose programming language, especially for analyzing large datasets without impacting the application performance. Its functional libraries can interact with databases and build scalable frameworks that create robust data pipelines. This book showcases how you can use Scala and its constructs to meet specific project demands.

BookJul 2018334 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages