MLlib is the machine learning (ML) library that is provided with Apache Spark, the in-memory, cluster-based, open source data processing system. In this chapter, I will examine the functionality of algorithms provided within the MLlib library in terms of areas of machine learning tasks such as classification, recommendation, and neural processing. For each algorithm, we'll provide working examples that tackle real problems. We will take a step-by-step approach in describing how the following algorithms can be used, and what they are capable of doing.
Big data and machine learning takes place in three steps-collect, analyze and predict. For this purpose, the Spark ecosystem supports a wide range of workloads, including batch applications, iterative algorithms, interactive queries, and stream processing. The Spark MLlib component offers a variety of ML algorithms which are scalable.
Spark MLlib's goal is to make practical ML scalable and easy. Similar to Spark Core, MLlib provides APIs in three languages that is, Python, Scala, and Java-with example code which will ease the learning curve for users coming from different backgrounds. The pipeline API in MLlib provides a uniform set of high-level APIs built on top of DataFrames that helps users create and tune practical ML pipelines. This API is under a new package with name spark.ml
.
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline or workflow. Let's see the key terms introduced by the pipeline API:
DataFrame: The ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. For example, a DataFrame could have different columns storing text, feature vectors, true labels and predictions.
Transformer: A transformer is an algorithm which can transform one DataFrame into another DataFrame...
Naive Bayes is a simple probabilistic classifier based on the Bayes theorem. This classifier is capable of calculating the most probable output depending on the input. It is possible to add new raw data at runtime and have a better probabilistic classifier. The Naive Bayes model is typically used for classification. There will be a bunch of features X1, X2,....Xn observed for an instance. The goal is to infer to which class among the limited set of classes the particular instance belongs. This model makes the assumption that every pair of features Xi and Xj is conditionally independent given the class. This classifier is a sub-class of Bayesian networks. For more information about the classifier, please refer to http://www.statsoft.com/textbook/naive-bayes-classifier.
This recipe shows how to run the Naive Bayes classifier on the weather
dataset using the Naive Bayes classifier algorithm available in the Spark MLlib package. The code is written in...
Decision trees are the most widely used data mining machine learning algorithm in practice for classification and regression. They are easy to interpret, handle categorical features and extend to the multiclass classification. This decision tree model, which is a powerful, non-probabilistic technique, captures more complex nonlinear patterns and feature interactions. Their outcome is quite understandable. They are not hard to use since it's not required to tweak a lot of parameters.
This recipe shows how to run the decision tree on web content which evaluates a large set of URLs and classifies them as ephemeral (that is, short-lived and will cease being popular soon) or evergreen (that last for longer time). It is available in the Spark MLlib package. The code is written in Scala.
Recommendation engines are one of the types of machine learning algorithms. Often, people might have experienced them using the popular websites such as Amazon, Netflix, YouTube, Twitter, LinkedIn and Facebook. The idea behind recommendation engines is to predict what people might like and to uncover relationships between the items to aid in the discovery process.
Recommender systems are widely studied and there are many approaches such as - content-based filtering and collaborative filtering. Other approaches, such as ranking models, have also gained popularity. Since Spark's recommendation models only include an implementation of matrix factorization, this recipe shows how to run matrix factorization on rating datasets from the MovieLens website.
The algorithm is available in the Spark MLLib package. The code is written in Scala.
In this recipe, let's see how to run logistic regression algorithms using Spark ML pipelines.
To step through this recipe, you will need a running Spark cluster in any one of the modes, that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer http://spark.apache.org/docs/latest/spark-standalone.html. Also, include the Spark MLlib package in the build.sbt
file so that it downloads the related libraries and the API can be used. Install Hadoop (optionally), Scala, and Java.
Please download the dataset from the following location: https://github.com/ChitturiPadma/datasets/blob/master/Community_Dataset.csv.