You're reading from Machine Learning with Scala Quick Start Guide

Product typeBook

Published inApr 2019

Reading LevelIntermediate

PublisherPackt

ISBN-139781789345070

Edition1st Edition

Languages

Scala

Concepts

Machine Learning

Authors (2):

Md. Rezaul Karim

Ajay Kumar N

View More author details

Scala for Regression Analysis

In this chapter, we will learn regression analysis in detail. We will start learning from the regression analysis workflow followed by the linear regression (LR) and generalized linear regression (GLR) algorithms. Then we will develop a regression model for predicting slowness in traffic using LR and GLR algorithms and their Spark ML-based implementation in Scala. Finally, we will learn the hyperparameter tuning with cross-validation and the grid searching techniques. Concisely, we will learn the following topics throughout this end-to-end project:

An overview of regression analysis
Regression analysis algorithms
Learning regression analysis through examples
Linear regression
Generalized linear regression
Hyperparameter tuning and cross-validation

Technical requirements

Make sure Scala 2.11.x and Java 1.8.x are installed and configured on your machine.

The code files of this chapters can be found on GitHub:

https://github.com/PacktPublishing/Machine-Learning-with-Scala-Quick-Start-Guide/tree/master/Chapter02

Check out the following video to see the Code in Action:

http://bit.ly/2GLlQTl

An overview of regression analysis

In the previous chapter, we already gained some basic understanding of the machine learning (ML) process, as we have seen the basic distinction between regression and classification. Regression analysis is a set of statistical processes for estimating the relationships between a set of variables called a dependent variable and one or multiple independent variables. The values of dependent variables depend on the values of independent variables.

A regression analysis technique helps us to understand this dependency, that is, how the value of the dependent variable changes when any one of the independent variables is changed, while the other independent variables are held fixed. For example, let's assume that there will be more savings in someone's bank when they grow older. Here, the amount of Savings (say in million $) depends on age...

Regression analysis algorithms

There are numerous algorithms proposed and available, which can be used for the regression analysis. For example, LR tries to find relationships and dependencies between variables. It models the relationship between a continuous dependent variable y (that is, a label or target) and one or more independent variables, x, using a linear function. Examples of regression algorithms include the following:

Linear regression (LR)
Generalized linear regression (GLR)
Survival regression (SR)
Isotonic regression (IR)
Decision tree regressor (DTR)
Random forest regression (RFR)
Gradient boosted trees regression (GBTR)

We start by explaining regression with the simplest LR algorithm, which models the relationship between a dependent variable, y, which involves a linear combination of interdependent variables, x:

In the preceding equation letters, β₀...

Learning regression analysis through examples

In the previous section, we discussed a simple real-life problem (that is, Age versus Savings). However, in practice, there are several real-life problems where more factors and parameters (that is, data properties) are involved, where regression can be applied too. Let's first introduce a real-life problem. Imagine that you live in Sao Paulo, a city in Brazil, where every day several hours of your valuable time are wasted because of unavoidable reasons such as an immobilized bus, broken truck, vehicle excess, accident victim, overtaking, fire vehicles, incident involving dangerous freight, lack of electricity, fire, and flooding.

Now, to measure how many man hours get wasted, we can we develop an automated technique, which will predict the slowness of traffic such that you can avoid certain routes or at least get some rough estimation...

Linear regression

In this section, we will develop a predictive analytics model for predicting slowness in traffic for each row of the data using an LR algorithm. First, we create an LR estimator as follows:

val lr = new LinearRegression()
     .setFeaturesCol("features")
     .setLabelCol("label")

Then we invoke the fit() method to perform the training on the training set as follows:

println("Building ML regression model")
val lrModel = lr.fit(trainingData)

Now we have the fitted model, which means it is now capable of making predictions. So, let's start evaluating the model on the training and validation sets and calculating the RMSE, MSE, MAE, R squared, and so on:

println("Evaluating the model on the test set and calculating the regression metrics")
// **********************************************************************
val trainPredictionsAndLabels...

Generalized linear regression (GLR)

In an LR, the output is assumed to follow a Gaussian distribution. In contrast, in generalized linear models (GLMs), the response variable Y_i follows some random distribution from a parametric set of probability distributions of a certain form. As we have seen in the previous example, following and creating a GLR estimator will not be difficult:

val glr = new GeneralizedLinearRegression()
      .setFamily("gaussian")//continuous value prediction (or gamma)
      .setLink("identity")//continuous value prediction (or inverse)
      .setFeaturesCol("features")
      .setLabelCol("label")

For the GLR-based prediction, the following response and identity link functions are supported based on data types (source: https://spark.apache.org/docs/latest/ml-classification-regression.html#generalized-linear-regression...

Hyperparameter tuning and cross-validation

In machine learning, the term hyperparameter refers to those parameters that cannot be learned from the regular training process directly. These are the various knobs that you can tweak on your machine learning algorithms. Hyperparameters are usually decided by training the model with different combinations of the parameters and deciding which ones work best by testing them. Ultimately, the combination that provides the best model would be our final hyperparameters. Setting hyperparameters can have a significant influence on the performance of the trained models.

On the other hand, cross-validation is often used in conjunction with hyperparameter tuning. Cross-validation (also know as rotation estimation) is a model validation technique for assessing the quality of the statistical analysis and results. Cross-validation helps to describe...

Summary

In this chapter, we have seen how to develop a regression model for analyzing insurance severity claims using LR and GLR algorithms. We have also seen how to boost the performance of the GLR model using cross-validation and grid search techniques, which give the best combination of hyperparameters. Finally, we have seen some frequently asked questions so that the similar regression techniques can be applied for solving other real-life problems.

In the next chapter, we will see another supervised learning technique called classification through a real-life problem called analyzing outgoing customers through churn prediction. Several classification algorithms will be used for making the prediction in Scala. Churn prediction is essential for businesses as it helps you detect customers who are likely to cancel a subscription, product, or service, which also minimizes customer...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning with Scala Quick Start Guide

Published in: Apr 2019Publisher: PacktISBN-13: 9781789345070

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

Ajay Kumar N

Ajay Kumar N has experience in big data, and specializes in cloud computing and various big data frameworks, including Apache Spark and Apache Hadoop. His primary language of choice is Python, but he also has a special interest in functional programming languages such as Scala. He has worked extensively with NumPy, pandas, and scikit-learn, and often contributes to open source projects related to data science and machine learning.
Read more about Ajay Kumar N

Other recommended products

Related to this chapter

Scala Machine Learning Projects

Scala is one of the widely used programming language in the world when it comes to handle large amount of data. With the rise of machine learning, data scientists and machine learning experts do prefer scala as a language in order to handle and scale efficient machine learning applications. You will be acquainted with the popular deep/machine learning libraries for Scala such as Spark ML/MLlib, H2O, DeepLearning4j, MXNET etc., and will use their features to build and deploy projects on a framework such as Apache Spark. By the end of this book, you will be able to dominate numerical computing, deep learning, and functional programming to carry out complex advanced tasks with ease.

BookJan 2018470 pages

Java Deep Learning Projects

You will build full-fledged, deep learning applications with Java and different open-source libraries. Master numerical computing, deep learning, and the latest Java programming features to carry out complex advanced tasks. This book is filled with best practices/tips after every project to help you optimize your deep learning models with ease.

BookJun 2018436 pages

Hands-On Deep Learning for IoT

This book will provide you an overview of Deep Learning techniques to facilitate the analytics and learning in various IoT apps. We will take you through each process - from data collection, analysis, modeling, statistics, and monitoring. We will make IoT data speak with a set of popular frameworks, like TensorFlow, TensorFlow Lite, and Chainer.

BookJun 2019308 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Hands-On Recommendation Systems with Python

Recommendation systems are at the heart of almost every internet business today; from Facebook to Netflix to Amazon. Providing good recommendations, whether it's friends, movies or groceries, goes a long way in defining user experience and enticing your customers to use and buy from your platform. This book teaches you to do just that.

BookJul 2018146 pages

Predictive Analytics with TensorFlow

Predictive decisions are becoming a huge trend worldwide, catering to wide industry sectors by predicting which decisions are more likely to give maximum results. Data mining, statistics, and machine learning allow users to discover predictive intelligence by uncovering patterns and showing the relationship between structured and unstructured data. This book will help you build solutions that will make automated decisions. In the end, tune and build your own predictive analytics model with the help of TensorFlow.

BookNov 2017522 pages

Machine Learning with Spark

Spark ML is the machine learning module of Spark. It uses in-memory RDDs to process machine learning models faster for clustering, classification, and regression.

BookApr 2017532 pages

TensorFlow: Powerful Predictive Analytics with TensorFlow

Predictive analytics discovers hidden patterns from structured and unstructured data for automated decision making in business intelligence. Predictive decisions are becoming a huge trend worldwide, catering to wide industry sectors by predicting which decisions are more likely to give maximum results. TensorFlow, Google’s brainchild, is immensely popular and extensively used for predictive analysis.

BookMar 2018164 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Apache Spark 2.x Machine Learning Cookbook

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.

BookSep 2017666 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages