Packt+ | Advance your knowledge in tech

You're reading from Apache Spark for Data Science Cookbook

Product type Book

Published in Dec 2016

Publisher

ISBN-13 9781785880100

Pages 392 pages

Edition 1st Edition

Languages

Concepts

Data Science

Author (1):

Padma Priya Chitturi

Table of Contents (17) Chapters

Apache Spark for Data Science Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Big Data Analytics with Spark

Tricky Statistics with Spark

Data Analysis with Spark

Clustering, Classification, and Regression

Working with Spark MLlib

NLP with Spark

Working with Sparkling Water - H2O

Data Visualization with Spark

Deep Learning on Spark

Working with SparkR

Chapter 5. Working with Spark MLlib

In this chapter, you will learn about the MLlib component of Spark. We will cover the following recipes:

Implementing Naive Bayes classification
Implementing decision trees
Building a recommendation system
Implementing logistic regression using Spark ML pipelines

Introduction

MLlib is the machine learning (ML) library that is provided with Apache Spark, the in-memory, cluster-based, open source data processing system. In this chapter, I will examine the functionality of algorithms provided within the MLlib library in terms of areas of machine learning tasks such as classification, recommendation, and neural processing. For each algorithm, we'll provide working examples that tackle real problems. We will take a step-by-step approach in describing how the following algorithms can be used, and what they are capable of doing.

Big data and machine learning takes place in three steps-collect, analyze and predict. For this purpose, the Spark ecosystem supports a wide range of workloads, including batch applications, iterative algorithms, interactive queries, and stream processing. The Spark MLlib component offers a variety of ML algorithms which are scalable.

Working with Spark ML pipelines

Spark MLlib's goal is to make practical ML scalable and easy. Similar to Spark Core, MLlib provides APIs in three languages that is, Python, Scala, and Java-with example code which will ease the learning curve for users coming from different backgrounds. The pipeline API in MLlib provides a uniform set of high-level APIs built on top of DataFrames that helps users create and tune practical ML pipelines. This API is under a new package with name spark.ml.

MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline or workflow. Let's see the key terms introduced by the pipeline API:

DataFrame: The ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. For example, a DataFrame could have different columns storing text, feature vectors, true labels and predictions.
Transformer: A transformer is an algorithm which can transform one DataFrame into another DataFrame...

Implementing Naive Bayes' classification

Naive Bayes is a simple probabilistic classifier based on the Bayes theorem. This classifier is capable of calculating the most probable output depending on the input. It is possible to add new raw data at runtime and have a better probabilistic classifier. The Naive Bayes model is typically used for classification. There will be a bunch of features X1, X2,....Xn observed for an instance. The goal is to infer to which class among the limited set of classes the particular instance belongs. This model makes the assumption that every pair of features Xi and Xj is conditionally independent given the class. This classifier is a sub-class of Bayesian networks. For more information about the classifier, please refer to http://www.statsoft.com/textbook/naive-bayes-classifier.

This recipe shows how to run the Naive Bayes classifier on the weather dataset using the Naive Bayes classifier algorithm available in the Spark MLlib package. The code is written in...

Implementing decision trees

Decision trees are the most widely used data mining machine learning algorithm in practice for classification and regression. They are easy to interpret, handle categorical features and extend to the multiclass classification. This decision tree model, which is a powerful, non-probabilistic technique, captures more complex nonlinear patterns and feature interactions. Their outcome is quite understandable. They are not hard to use since it's not required to tweak a lot of parameters.

This recipe shows how to run the decision tree on web content which evaluates a large set of URLs and classifies them as ephemeral (that is, short-lived and will cease being popular soon) or evergreen (that last for longer time). It is available in the Spark MLlib package. The code is written in Scala.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone...

Building a recommendation system

Recommendation engines are one of the types of machine learning algorithms. Often, people might have experienced them using the popular websites such as Amazon, Netflix, YouTube, Twitter, LinkedIn and Facebook. The idea behind recommendation engines is to predict what people might like and to uncover relationships between the items to aid in the discovery process.

Recommender systems are widely studied and there are many approaches such as - content-based filtering and collaborative filtering. Other approaches, such as ranking models, have also gained popularity. Since Spark's recommendation models only include an implementation of matrix factorization, this recipe shows how to run matrix factorization on rating datasets from the MovieLens website.

The algorithm is available in the Spark MLLib package. The code is written in Scala.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone...

Implementing logistic regression using Spark ML pipelines

In this recipe, let's see how to run logistic regression algorithms using Spark ML pipelines.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer http://spark.apache.org/docs/latest/spark-standalone.html. Also, include the Spark MLlib package in the build.sbt file so that it downloads the related libraries and the API can be used. Install Hadoop (optionally), Scala, and Java.

Note

Please download the dataset from the following location: https://github.com/ChitturiPadma/datasets/blob/master/Community_Dataset.csv.

How to do it…

Let's have a look at some of the records in the dataset. The first column corresponds to the label and the other three columns are the features:
Here is the code which creates a DataFrame out of the previous data and creates an instance for LogisticRegression:
```
 ...
```