Reader small image

You're reading from  Apache Spark for Data Science Cookbook

Product typeBook
Published inDec 2016
Publisher
ISBN-139781785880100
Edition1st Edition
Concepts
Right arrow
Author (1)
Padma Priya Chitturi
Padma Priya Chitturi
author image
Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi

Right arrow

Chapter 3. Data Analysis with Spark

In this chapter, we will cover the following recipes on performing data analysis with Spark:

  • Univariate analysis

  • Bivariate analysis

  • Missing value treatment

  • Outlier detection

  • Use case - analyzing the MovieLens dataset

  • Use case - analyzing the Uber dataset

Introduction


The techniques for data exploration and preparation are typically applied before applying models on the data and this also helps in developing complex statistical models. These techniques are also important for eliminating or sharpening a potential hypothesis which can be addressed by the data. The amount of time spent in preprocessing and data exploration provides the quality input which decides the quality of the output. Once the business hypothesis is ready, a series of steps in data exploration and preparation decides the accuracy of the model and reliable results.

In this chapter, we are going to look at the following common data analysis techniques such as univariate analysis, bivariate analysis, missing values treatment, identifying the outliers, and techniques for variable transformation.

Univariate analysis


Once the data is available, we have to spend lot of time and effort in data exploration, cleaning and preparation because the quality of the high input data decides the quality of calculating the output. Hence, once we identify the business questions, the first step of data exploration/analysis is univariate analysis, which explores the variables one by one. The methods of univariate analysis depend on whether the variable type is categorical or continuous.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. Also, include the Spark MLlib package in the build.sbt file so that it downloads the related libraries and the API can be used. Install Hadoop (optionally), Scala, and Java.

How to do it…

  1. Let's take the example of Titanic dataset. On April 15, 1912, the...

Bivariate analysis


Bivariate analysis finds out the relationship between two variables. In this, we always look for association and disassociation between variables at a predefined significance level. This analysis could be performed for any combination of categorical and continuous variables. The various combinations can be: both the variables categorical, categorical and continuous, and continuous and continuous.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. Also, include the Spark MLlib package in the build.sbt file so that it downloads the related libraries and the API can be used. Install Hadoop (optionally), Scala, and Java.

How to do it…

  1. After univariate analysis, let's try to perform bivariate analysis on various combinations of continuous and categorical variables...

Missing value treatment


Missing data in the training dataset can reduce the fitness of a model or can lead to a biased model because we have not analyzed the behavior and relationship with other variables correctly. This could also lead to wrong predictions or classifications. The reasons for the occurrence of the missing values could be that while extracting data from multiple sources, there is a possible chance to have missing data. Hence, using some hashing procedure ensures that the data extraction is correct. The errors that occur at the time of data collection are tougher to correct as the values might miss at random and the missing values might also depend on the unobserved predictors.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer http://spark.apache.org/docs/latest/spark-standalone.html. Also, include the Spark MLlib package...

Outlier detection


Outliers are infrequent observations, that is, the data points that do not appear to follow the characteristic distribution of the rest of the data. They appear far away and diverge from the overall pattern of the data. These might occur due to measurement errors or other anomalies which result in wrong estimations. Outliers can be univariate and multivariate. Univariate outliers can be determined by looking at the distribution of a single variable whereas multivariate outliers are present in an n-dimensional space which can be found by looking at the distributions in multi-dimensions.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer http://spark.apache.org/docs/latest/spark-standalone.html. Also, include the Spark MLlib package in the build.sbt file so that it downloads the related libraries and the API can be used...

Use case - analyzing the MovieLens dataset


In the previous recipes, we saw various steps of performing data analysis. In this recipe, let's download the commonly used dataset for movie recommendations. The dataset is known as the MovieLens dataset. The dataset is quite applicable for recommender systems as well as potentially for other machine learning tasks.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. Also, include the Spark MLlib package in the build.sbt file so that it downloads the related libraries and the API can be used. Install Hadoop (optionally), Scala, and Java.

How to do it…

Let's see how to analyse the MovieLens dataset.

  1. Let's download the MovieLens dataset from the following location: https://drive.google.com/file/d/0Bxr27gVaXO5sRUZnMjBQR0lqNDA/view?usp=sharing...

Use case - analyzing the Uber dataset


In the previous recipes, we saw various steps of performing data analysis. In this recipe, let's download the Uber dataset and try to solve some of the analytical questions that arise on such data.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer http://spark.apache.org/docs/latest/spark-standalone.html. Also, include the Spark MLlib package in the build.sbt file so that it downloads the related libraries and the API can be used. Install Hadoop (optionally), Scala, and Java.

How to do it…

In this section, let's see how to analyse the Uber dataset.

  1. Let's download the Uber dataset from the following location: https://github.com/ChitturiPadma/datasets/blob/master/uber.csv.

  2. The dataset contains four columns: dispatching_base_number, date, active_vehicles, and trips. Let's load the data and see what the...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Spark for Data Science Cookbook
Published in: Dec 2016Publisher: ISBN-13: 9781785880100
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Padma Priya Chitturi

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey deans work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.
Read more about Padma Priya Chitturi