Packt+ | Advance your knowledge in tech

You're reading from Big Data Analytics with Hadoop 3

Product type Book

Published in May 2018

Publisher Packt

ISBN-13 9781788628846

Pages 482 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Sridhar Alla

Table of Contents (18) Chapters

Title Page

Packt Upsell

Contributors

Preface

Introduction to Hadoop

Overview of Big Data Analytics

Big Data Processing with MapReduce

Scientific Computing and Big Data Analysis with Python and Hadoop

Statistical Big Data Computing with R and Hadoop

Batch Analytics with Apache Spark

Real-Time Analytics with Apache Spark

Batch Analytics with Apache Flink

Stream Processing with Apache Flink

Visualizing Big Data

Introduction to Cloud Computing

Using Amazon Web Services

Index

Chapter 6. Batch Analytics with Apache Spark

In this chapter, you will learn about Apache Spark and how to use it for big data analytics based on a batch processing model. Spark SQL is a component on top of Spark Core that can be used to query structured data. It is becoming the de facto tool, replacing Hive as the choice for batch analytics on Hadoop.

Moreover, you will learn how to use Spark for the analysis of structured data (unstructured data such as a document containing arbitrary text, or some other format that has to be transformed into a structured form). We will see how DataFrames/datasets are the cornerstone here, and how SparkSQL's APIs make querying structured data simple yet robust.

We will also introduce datasets and see the difference between datasets, DataFrames, and RDDs. In a nutshell, the following topics will be covered in this chapter:

SparkSQL and DataFrames
DataFrames and the SQL API
DataFrame schema
Datasets and encoders
Loading and saving data
Aggregations
Joins

SparkSQL and DataFrames

Before Apache Spark, Apache Hive was the go-to technology whenever anyone wanted to run an SQL-like query on large amount of data. Apache Hive essentially translated an SQL query into MapReduce, like logic automatically making it very easy to perform many kinds of analytics on big data without actually learning to write complex code in Java and Scala.

With the advent of Apache Spark, there was a paradigm shift in how we could perform analysis at a big data scale. Spark SQL provides an SQL-like layer on top of Apache Spark's distributed computation abilities that is rather simple to use. In fact, Spark SQL can be used as an online analytical processing database. Spark SQL works by parsing the SQL-like statement into an abstract syntax tree (AST), subsequently converting that plan to a logical plan and then optimizing the logical plan into a physical plan that can be executed, as shown in the following diagram:

The final execution uses the underlying DataFrame API, making...

DataFrame APIs and the SQL API

A DataFrame can be created in several ways; some of them are as follows:

Execute SQL queries, load external data such as Parquet, JSON, CSV, Text, Hive, JDBC, and so on
Convert RDDs to DataFrames
Load a CSV file

We will take a look at statesPopulation.csv here, which we will then load as a DataFrame.

The CSV has the following format of the population of US states from the years 2010 to 2016:

State	Year	Population
Alabama	2010	47,85,492
Alaska	2010	714,031
Arizona	2010	64,08,312
Arkansas	2010	2,921,995
California	2010	37,332,685

Since this CSV has a header, we can use it to quickly load into a DataFrame with an implicit schema detection:

scala> val statesDF = spark.read.option("header",
"true").option("inferschema", "true").option("sep",
",").csv("statesPopulation.csv")
statesDF: org.apache.spark.sql.DataFrame = [State: string, Year: int ... 1
more field]

Once we load the DataFrame, it can be examined for the schema:

scala> statesDF.printSchema
root
|-- State: string (nullable ...

Schema – structure of data

A schema is the description of the structure of your data and can be either implicit or explicit. There are two main ways to convert existing RDDs into datasets as the DataFrames are internally based on the RDD; they are as follows:

Using reflection to infer the schema of the RDD
Through a programmatic interface with the help of which you can take an existing RDD and render a schema to convert the RDD into a dataset with schema

Implicit schema

Let's look at an example of loading a comma-separated values (CSV) file into a DataFrame. Whenever a text file contains a header, the read API can infer the schema by reading the header line. We also have the option to specify the separator to be used to split the text file lines.

We read the csv inferring the schema from the header line and use the comma (,) as the separator. We also show the use of the schema command and the printSchema command to verify the schema of the input file:

scala> val statesDF = spark.read.option...

Loading datasets

Spark SQL can read data from external storage systems such as files, Hive tables, and JDBC databases through the DataFrameReader interface.

The format of the API call is spark.read.inputtype:

Parquet
CSV
Hive table
JDBC
ORC
Text
JSON

Let's look at a couple of simple examples of reading CSV files into DataFrames:

scala> val statesPopulationDF = spark.read.option("header",
"true").option("inferschema", "true").option("sep",
",").csv("statesPopulation.csv")
statesPopulationDF: org.apache.spark.sql.DataFrame = [State: string, Year:
int ... 1 more field]
scala> val statesTaxRatesDF = spark.read.option("header",
"true").option("inferschema", "true").option("sep",
",").csv("statesTaxRates.csv")
statesTaxRatesDF: org.apache.spark.sql.DataFrame = [State: string, TaxRate:
double]

Saving datasets

Spark SQL can save data to external storage systems like files, Hive tables and JDBC databases through theDataFrameWriter interface.

Format of the API call isdataframe.write.outputtype:

Parquet
ORC
Text
Hive table
JSON
CSV
JDBC

Let's look at couple of examples of writing or saving a DataFrame to a CSV file:

scala> statesPopulationDF.write.option("header",
"true").csv("statesPopulation_dup.csv")
scala> statesTaxRatesDF.write.option("header",
"true").csv("statesTaxRates_dup.csv")

Aggregations

Aggregation is the method of collecting data together based on a condition and performing analytics on the data. Aggregation is very important to make sense of data of all sizes as just having raw records of data is not that useful for most use cases.

Note

Imagine a table containing one temperature measurement per day for every city in the world for five years.

For example, if you see the following table and then the aggregated view of the same data then it is obvious that just raw records do not help you understand the data. Shown below is the raw data in the form of a table:

City	Date	Temperature
Boston	12/23/2016	32
New York	12/24/2016	36
Boston	12/24/2016	30
Philadelphia	12/25/2016	34
Boston	12/25/2016	28

Shown below is the average temperature per city:

City	AverageTemperature
Boston	30 - (32 + 30 + 28)/3
New York	36
Philadelphia	34

Aggregate functions

Aggregations can be performed with the help of functions that can be found in the org.apache.spark.sql.functions package. In addition to this, custom...

Joins

In traditional databases, joins are used to join one transaction table with another lookup table to generate a more complete view. For example, if you have a table of online transactions sorted by customer ID and another table containing the customer city and customer ID, you can use join to generate reports on the transactions sorted by city.

Transactions table: This table has three columns, the CustomerID, the Purchased item, and how much the customer paid for the item:

CustomerID	Purchased Item	Price Paid
1	Headphones	25.00
2	Watch	20.00
3	Keyboard	20.00
1	Mouse	10.00
4	Cable	10.00
3	Headphones	30.00

Customer Info table: This table has two columns the CustomerID and the City the customer lives in:

Customer ID	City
1	Boston
2	New York
3	Philadelphia
4	Boston

Joining the transaction table with the customer info table will generate a view as follows:

Summary

In this chapter, we have discussed the origin of DataFrames and how Spark SQL provides the SQL interface on top of DataFrame. The power of DataFrames is such that the execution times have decreased over the original RDD-based computations. Having such a powerful layer with a simple SQL-like interface makes it all the more powerful. We also looked at various APIs to create and manipulate DataFrames and dug deeper into the sophisticated features of aggregations, including groupBy, Window, rollup, and cubes. Finally, we also looked at the concept of joining datasets and the various types of joins possible such as inner, outer, cross, and so on.

We will explore the exciting world of real-time data processing and analytics in Chapter 7, Real-Time Analytics with Apache Spark.

The rest of the chapter is locked

You have been reading a chapter from

Big Data Analytics with Hadoop 3

Published in: May 2018 Publisher: Packt ISBN-13: 9781788628846

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime}

Authors (1)

Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.

See other products by Sridhar Alla

Other recommended products

Related to this chapter

Learning Apache Flink

Feb 2017 9 hours 20 minutes

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

Jul 2017 26 hours 32 minutes

Practical Predictive Analytics

This book teaches six specific steps needed to implement predictive analytics using R. It also teaches how team collaboration is critical and how it increases the chances of implementing a successful model. The book uses cases from healthcare, marketing, and government to build practical skills. Big Data is also covered, in this book, which will extend your skill sets by learning Databricks and RSpark.

Jun 2017 19 hours 12 minutes

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

Oct 2018 7 hours 20 minutes

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

Feb 2019 18 hours 8 minutes

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

Jun 2018 7 hours 0 minutes

Amazon Redshift Cookbook

The Amazon Redshift Cookbook helps you get to grips with architecting Redshift and performing database administration tasks. You'll learn techniques for building pipelines, loading data optimally, and deriving insights from this data, along with understanding how to optimize performance and costs associated with data warehouses, and build ingestion patterns with Amazon Redshift.

Jul 2021 12 hours 48 minutes

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

Jan 2019 5 hours 8 minutes

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

Jun 2018 11 hours 0 minutes

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

Sep 2017 15 hours 4 minutes

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

May 2017 9 hours 48 minutes

Learning Apache Apex

Applications that use and evaluate real-time streams need to take the features of the underlying processing engine into account. This is the first book about Apache Apex, teaching readers how to include the real-time streaming engine Apex in a functioning application, and which parts to add to make it performant and usable.

Nov 2017 9 hours 40 minutes

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

Aug 2023 7 hours 40 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Dec 2023 12 hours 0 minutes

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

Aug 2023 22 hours 48 minutes

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

Sep 2023 8 hours 36 minutes

Building AI Applications with ChatGPT APIs

Sep 2023 8 hours 36 minutes

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Oct 2023 21 hours 12 minutes

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

Aug 2023 14 hours 0 minutes

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

Dec 2023 8 hours 0 minutes

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

Nov 2023 22 hours 8 minutes

Customer ID	Purchased Item	Price Paid	City
1	Headphone	25.00	Boston
2	Watch	100.00	New York
3	Keyboard	20.00	Philadelphia
1	Mouse	10.00	Boston
4	Cable	10.00	Boston
3	Headphones	30.00	Philadelphia...