Reader small image

You're reading from  Machine Learning with Scala Quick Start Guide

Product typeBook
Published inApr 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789345070
Edition1st Edition
Languages
Right arrow
Authors (2):
Md. Rezaul Karim
Md. Rezaul Karim
author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

Ajay Kumar N
Ajay Kumar N
author image
Ajay Kumar N

Ajay Kumar N has experience in big data, and specializes in cloud computing and various big data frameworks, including Apache Spark and Apache Hadoop. His primary language of choice is Python, but he also has a special interest in functional programming languages such as Scala. He has worked extensively with NumPy, pandas, and scikit-learn, and often contributes to open source projects related to data science and machine learning.
Read more about Ajay Kumar N

View More author details
Right arrow

Scala for Regression Analysis

In this chapter, we will learn regression analysis in detail. We will start learning from the regression analysis workflow followed by the linear regression (LR) and generalized linear regression (GLR) algorithms. Then we will develop a regression model for predicting slowness in traffic using LR and GLR algorithms and their Spark ML-based implementation in Scala. Finally, we will learn the hyperparameter tuning with cross-validation and the grid searching techniques. Concisely, we will learn the following topics throughout this end-to-end project:

  • An overview of regression analysis
  • Regression analysis algorithms
  • Learning regression analysis through examples
  • Linear regression
  • Generalized linear regression
  • Hyperparameter tuning and cross-validation

Technical requirements

An overview of regression analysis

In the previous chapter, we already gained some basic understanding of the machine learning (ML) process, as we have seen the basic distinction between regression and classification. Regression analysis is a set of statistical processes for estimating the relationships between a set of variables called a dependent variable and one or multiple independent variables. The values of dependent variables depend on the values of independent variables.

A regression analysis technique helps us to understand this dependency, that is, how the value of the dependent variable changes when any one of the independent variables is changed, while the other independent variables are held fixed. For example, let's assume that there will be more savings in someone's bank when they grow older. Here, the amount of Savings (say in million $) depends on age...

Regression analysis algorithms

There are numerous algorithms proposed and available, which can be used for the regression analysis. For example, LR tries to find relationships and dependencies between variables. It models the relationship between a continuous dependent variable y (that is, a label or target) and one or more independent variables, x, using a linear function. Examples of regression algorithms include the following:

  • Linear regression (LR)
  • Generalized linear regression (GLR)
  • Survival regression (SR)
  • Isotonic regression (IR)
  • Decision tree regressor (DTR)
  • Random forest regression (RFR)
  • Gradient boosted trees regression (GBTR)

We start by explaining regression with the simplest LR algorithm, which models the relationship between a dependent variable, y, which involves a linear combination of interdependent variables, x:

In the preceding equation letters, β0...

Learning regression analysis through examples

In the previous section, we discussed a simple real-life problem (that is, Age versus Savings). However, in practice, there are several real-life problems where more factors and parameters (that is, data properties) are involved, where regression can be applied too. Let's first introduce a real-life problem. Imagine that you live in Sao Paulo, a city in Brazil, where every day several hours of your valuable time are wasted because of unavoidable reasons such as an immobilized bus, broken truck, vehicle excess, accident victim, overtaking, fire vehicles, incident involving dangerous freight, lack of electricity, fire, and flooding.

Now, to measure how many man hours get wasted, we can we develop an automated technique, which will predict the slowness of traffic such that you can avoid certain routes or at least get some rough estimation...

Linear regression

In this section, we will develop a predictive analytics model for predicting slowness in traffic for each row of the data using an LR algorithm. First, we create an LR estimator as follows:

val lr = new LinearRegression()
.setFeaturesCol("features")
.setLabelCol("label")

Then we invoke the fit() method to perform the training on the training set as follows:

println("Building ML regression model")
val lrModel = lr.fit(trainingData)

Now we have the fitted model, which means it is now capable of making predictions. So, let's start evaluating the model on the training and validation sets and calculating the RMSE, MSE, MAE, R squared, and so on:

println("Evaluating the model on the test set and calculating the regression metrics")
// **********************************************************************
val trainPredictionsAndLabels...

Generalized linear regression (GLR)

In an LR, the output is assumed to follow a Gaussian distribution. In contrast, in generalized linear models (GLMs), the response variable Yi follows some random distribution from a parametric set of probability distributions of a certain form. As we have seen in the previous example, following and creating a GLR estimator will not be difficult:

val glr = new GeneralizedLinearRegression()
.setFamily("gaussian")//continuous value prediction (or gamma)
.setLink("identity")//continuous value prediction (or inverse)
.setFeaturesCol("features")
.setLabelCol("label")

For the GLR-based prediction, the following response and identity link functions are supported based on data types (source: https://spark.apache.org/docs/latest/ml-classification-regression.html#generalized-linear-regression...

Hyperparameter tuning and cross-validation

In machine learning, the term hyperparameter refers to those parameters that cannot be learned from the regular training process directly. These are the various knobs that you can tweak on your machine learning algorithms. Hyperparameters are usually decided by training the model with different combinations of the parameters and deciding which ones work best by testing them. Ultimately, the combination that provides the best model would be our final hyperparameters. Setting hyperparameters can have a significant influence on the performance of the trained models.

On the other hand, cross-validation is often used in conjunction with hyperparameter tuning. Cross-validation (also know as rotation estimation) is a model validation technique for assessing the quality of the statistical analysis and results. Cross-validation helps to describe...

Summary

In this chapter, we have seen how to develop a regression model for analyzing insurance severity claims using LR and GLR algorithms. We have also seen how to boost the performance of the GLR model using cross-validation and grid search techniques, which give the best combination of hyperparameters. Finally, we have seen some frequently asked questions so that the similar regression techniques can be applied for solving other real-life problems.

In the next chapter, we will see another supervised learning technique called classification through a real-life problem called analyzing outgoing customers through churn prediction. Several classification algorithms will be used for making the prediction in Scala. Churn prediction is essential for businesses as it helps you detect customers who are likely to cancel a subscription, product, or service, which also minimizes customer...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with Scala Quick Start Guide
Published in: Apr 2019Publisher: PacktISBN-13: 9781789345070
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

author image
Ajay Kumar N

Ajay Kumar N has experience in big data, and specializes in cloud computing and various big data frameworks, including Apache Spark and Apache Hadoop. His primary language of choice is Python, but he also has a special interest in functional programming languages such as Scala. He has worked extensively with NumPy, pandas, and scikit-learn, and often contributes to open source projects related to data science and machine learning.
Read more about Ajay Kumar N