Reader small image

You're reading from  Scala Data Analysis Cookbook

Product typeBook
Published inOct 2015
Reading LevelIntermediate
Publisher
ISBN-139781784396749
Edition1st Edition
Languages
Right arrow
Author (1)
Arun Manivannan
Arun Manivannan
author image
Arun Manivannan

Arun Manivannan has been an engineer in various multinational companies, tier-1 financial institutions, and start-ups, primarily focusing on developing distributed applications that manage and mine data. His languages of choice are Scala and Java, but he also meddles around with various others for kicks. He blogs at http://rerun.me. Arun holds a master's degree in software engineering from the National University of Singapore. He also holds degrees in commerce, computer applications, and HR management. His interests and education could probably be a good dataset for clustering.
Read more about Arun Manivannan

Right arrow

Chapter 5. Learning from Data

In this chapter, we will cover the following recipes:

  • Predicting continuous values using linear regression

  • Binary classification using LogisticRegression and SVM

  • Binary classification using LogisticRegression with the Pipeline API

  • Clustering using K-means

  • Feature reduction using principal component analysis

Introduction


In previous chapters, we saw how to load, prepare, and visualize data. Now, let's start doing some interesting stuff with it. In this chapter, we'll be looking into applying various machine learning techniques on top of it. We'll look at a few examples for the two broad classifications of machine learning techniques: supervised and unsupervised learning. Before that, however, let's briefly see what these terms mean.

Supervised and unsupervised learning


If you are reading this book, you probably already know what supervised and unsupervised learning are, but for the sake of completion, let's briefly summarize what they mean. In supervised learning, we train the algorithms with labeled data. Labeled data is nothing but input data along with the outcome variable. For example, if our intention is to predict whether a website is about news, we would be preparing a sample dataset of website content with "news" and "not news" as labels. This dataset is called the training dataset.

With supervised learning, our end goal is to use the training dataset and come up with a function that maps our input variables to an output variable with least margin of error. We call input variables (or x variables) features or explanatory variables, and the output variable (also known as the y variable or label) the target or dependent variable. In the news website example, the text content in the website would be the input variable...

Gradient descent


With supervised learning, in order for the algorithm to learn the relationship between the input and the output features, we provide a set of manually curated values for the target variable (y) against a set of input variables (x). We call it the training set. The learning algorithm then has to go over our training set, perform some optimization, and come up with a model that has the least cost—deviation from the true values. So technically, we have two algorithms for every learning problem: an algorithm that comes up with the function and (an initial set of) weights for each of the x features, and a supporting algorithm (also called cost minimization or optimization algorithm) that looks at our function parameters (feature weights) and tries to minimize the cost as much as possible.

There are a variety of cost minimization algorithms, but one of the most popular is gradient descent. Imagine gradient descent as climbing down a mountain. The height of the mountain represents...

Predicting continuous values using linear regression


At the risk of stating the obvious, linear regression aims to find the relationship between an output (y) based on an input (x) using a mathematical model that is linear to the input variables. The output variable, y, is a continuous numerical value. If we have more than one input/explanatory variable (x), as in the example that we are going to see, we call it multiple linear regression. The dataset that we'll use for this recipe, for lack of creativity, is lifted from the UCI website at http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/. This dataset has 1599 instances of various red wines, their chemical composition, and their quality. We'll use it to predict the quality of a red wine.

How to do it...

Let's summarize the steps:

  1. Importing the data.

  2. Converting each instance into a LabeledPoint.

  3. Preparing the training and test data.

  4. Scaling the features.

  5. Training the model.

  6. Predicting against the test data.

  7. Evaluating the model...

Binary classification using LogisticRegression and SVM


Unlike linear regression, wherein we predicted continuous values for the outcome (the y variable), logistic regression and the Support Vector Machine (SVM) are used to predict just one out of the n possibilities for the outcome (the y variable). If the outcome is one of two possibilities, then the classification is called a binary classification.

Logistic regression, when used for binary classification, looks at each data point and estimates the probability of that data point falling under the positive case. If the probability is less than a threshold, then the outcome is negative (or 0); otherwise, the outcome is positive (or 1).

As with any other supervised learning techniques, we will be providing training examples for logistic regression. We then add a bit of code for feature extraction and let the algorithm create a model that encapsulates the probability of each of the features belonging to one of the binary outcomes.

What SVM tries...

Binary classification using LogisticRegression with Pipeline API


Earlier, with the spam example on binary classification, we saw how we prepared the data, separated it into training and test data, trained the model, and evaluated it against test data before we finally arrived at the metrics. This series of steps can be abstracted in a simplified manner using Spark's Pipeline API.

In this recipe, we'll take a look at how to use the Pipeline API to solve the same classification problem. Imagine the pipeline to be a factory assembly line where things happen one after another. In our case, we'll pass our raw unprocessed data through various processors before we finally feed the data into the classifier.

How to do it...

In this recipe, we'll classify the same spam/ham dataset (https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) first using the plain Pipeline, and then using a cross-validator to select the best model for us given a grid of parameters.

Let's summarize the steps:

  1. Importing and...

Clustering using K-means


Clustering is a class of unsupervised learning algorithms wherein the dataset is partitioned into a finite number of clusters in such a way that the points within a cluster are similar to each other in some way. This, intuitively, also means that the points of two different clusters should be dissimilar.

K-means is one of the popular clustering algorithms, and in this recipe, we'll be looking at how Spark implements K-means and how to use the algorithm to cluster a sample dataset. Since the number of clusters is a crucial input for the K-means algorithm, we'll also see the most common method of arriving at the optimal number of clusters for the data.

How to do it...

Spark provides two initialization modes for cluster center (centroid) initialization: the original Lloyd's method (https://en.wikipedia.org/wiki/K-means_clustering), and a parallelizable and scalable variant of K-means++ (https://en.wikipedia.org/wiki/K-means%2B%2B). K-means++ itself is a variant of the...

Feature reduction using principal component analysis


Quoting the curse of dimensionality (https://en.wikipedia.org/wiki/Curse_of_dimensionality), large number of features are computationally expensive. One way of reducing the number of features is by manually choosing and ignoring certain features. However, identification of the same features (represented differently) or highly correlated features is laborious when we have a huge number of features. Dimensionality reduction is aimed at reducing the number of features in the data while still retaining its variability.

Say, we have a dataset of housing prices and there are two features that represent the area of the house in feet and meters; we can always drop one of these two. Dimensionality reduction is very useful when dealing with text where the number of features easily runs into a few thousands.

In this recipe, we'll be looking into Principal Component Analysis (PCA) as a means to reduce the dimensions of data that is meant for both supervised...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scala Data Analysis Cookbook
Published in: Oct 2015Publisher: ISBN-13: 9781784396749
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Arun Manivannan

Arun Manivannan has been an engineer in various multinational companies, tier-1 financial institutions, and start-ups, primarily focusing on developing distributed applications that manage and mine data. His languages of choice are Scala and Java, but he also meddles around with various others for kicks. He blogs at http://rerun.me. Arun holds a master's degree in software engineering from the National University of Singapore. He also holds degrees in commerce, computer applications, and HR management. His interests and education could probably be a good dataset for clustering.
Read more about Arun Manivannan