Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Spark for Data Science Cookbook

You're reading from  Apache Spark for Data Science Cookbook

Product type Book
Published in Dec 2016
Publisher
ISBN-13 9781785880100
Pages 392 pages
Edition 1st Edition
Languages
Author (1):
Padma Priya Chitturi Padma Priya Chitturi
Profile icon Padma Priya Chitturi

Table of Contents (17) Chapters

Apache Spark for Data Science Cookbook
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Big Data Analytics with Spark Tricky Statistics with Spark Data Analysis with Spark Clustering, Classification, and Regression Working with Spark MLlib NLP with Spark Working with Sparkling Water - H2O Data Visualization with Spark Deep Learning on Spark Working with SparkR

Chapter 5. Working with Spark MLlib

In this chapter, you will learn about the MLlib component of Spark. We will cover the following recipes:

  • Implementing Naive Bayes classification

  • Implementing decision trees

  • Building a recommendation system

  • Implementing logistic regression using Spark ML pipelines

Introduction


MLlib is the machine learning (ML) library that is provided with Apache Spark, the in-memory, cluster-based, open source data processing system. In this chapter, I will examine the functionality of algorithms provided within the MLlib library in terms of areas of machine learning tasks such as classification, recommendation, and neural processing. For each algorithm, we'll provide working examples that tackle real problems. We will take a step-by-step approach in describing how the following algorithms can be used, and what they are capable of doing.

Big data and machine learning takes place in three steps-collect, analyze and predict. For this purpose, the Spark ecosystem supports a wide range of workloads, including batch applications, iterative algorithms, interactive queries, and stream processing. The Spark MLlib component offers a variety of ML algorithms which are scalable.

Working with Spark ML pipelines


Spark MLlib's goal is to make practical ML scalable and easy. Similar to Spark Core, MLlib provides APIs in three languages that is, Python, Scala, and Java-with example code which will ease the learning curve for users coming from different backgrounds. The pipeline API in MLlib provides a uniform set of high-level APIs built on top of DataFrames that helps users create and tune practical ML pipelines. This API is under a new package with name spark.ml.

MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline or workflow. Let's see the key terms introduced by the pipeline API:

  • DataFrame: The ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. For example, a DataFrame could have different columns storing text, feature vectors, true labels and predictions.

  • Transformer: A transformer is an algorithm which can transform one DataFrame into another DataFrame...

Implementing Naive Bayes' classification


Naive Bayes is a simple probabilistic classifier based on the Bayes theorem. This classifier is capable of calculating the most probable output depending on the input. It is possible to add new raw data at runtime and have a better probabilistic classifier. The Naive Bayes model is typically used for classification. There will be a bunch of features X1, X2,....Xn observed for an instance. The goal is to infer to which class among the limited set of classes the particular instance belongs. This model makes the assumption that every pair of features Xi and Xj is conditionally independent given the class. This classifier is a sub-class of Bayesian networks. For more information about the classifier, please refer to http://www.statsoft.com/textbook/naive-bayes-classifier.

This recipe shows how to run the Naive Bayes classifier on the weather dataset using the Naive Bayes classifier algorithm available in the Spark MLlib package. The code is written in...

Implementing decision trees


Decision trees are the most widely used data mining machine learning algorithm in practice for classification and regression. They are easy to interpret, handle categorical features and extend to the multiclass classification. This decision tree model, which is a powerful, non-probabilistic technique, captures more complex nonlinear patterns and feature interactions. Their outcome is quite understandable. They are not hard to use since it's not required to tweak a lot of parameters.

This recipe shows how to run the decision tree on web content which evaluates a large set of URLs and classifies them as ephemeral (that is, short-lived and will cease being popular soon) or evergreen (that last for longer time). It is available in the Spark MLlib package. The code is written in Scala.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone...

Building a recommendation system


Recommendation engines are one of the types of machine learning algorithms. Often, people might have experienced them using the popular websites such as Amazon, Netflix, YouTube, Twitter, LinkedIn and Facebook. The idea behind recommendation engines is to predict what people might like and to uncover relationships between the items to aid in the discovery process.

Recommender systems are widely studied and there are many approaches such as - content-based filtering and collaborative filtering. Other approaches, such as ranking models, have also gained popularity. Since Spark's recommendation models only include an implementation of matrix factorization, this recipe shows how to run matrix factorization on rating datasets from the MovieLens website.

The algorithm is available in the Spark MLLib package. The code is written in Scala.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone...

Implementing logistic regression using Spark ML pipelines


In this recipe, let's see how to run logistic regression algorithms using Spark ML pipelines.

Getting ready

To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer http://spark.apache.org/docs/latest/spark-standalone.html. Also, include the Spark MLlib package in the build.sbt file so that it downloads the related libraries and the API can be used. Install Hadoop (optionally), Scala, and Java.

Note

Please download the dataset from the following location: https://github.com/ChitturiPadma/datasets/blob/master/Community_Dataset.csv.

How to do it…

  1. Let's have a look at some of the records in the dataset. The first column corresponds to the label and the other three columns are the features:

  2. Here is the code which creates a DataFrame out of the previous data and creates an instance for LogisticRegression:

     ...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Apache Spark for Data Science Cookbook
Published in: Dec 2016 Publisher: ISBN-13: 9781785880100
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}