Reader small image

You're reading from  Machine Learning with Scala Quick Start Guide

Product typeBook
Published inApr 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789345070
Edition1st Edition
Languages
Right arrow
Authors (2):
Md. Rezaul Karim
Md. Rezaul Karim
author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

Ajay Kumar N
Ajay Kumar N
author image
Ajay Kumar N

Ajay Kumar N has experience in big data, and specializes in cloud computing and various big data frameworks, including Apache Spark and Apache Hadoop. His primary language of choice is Python, but he also has a special interest in functional programming languages such as Scala. He has worked extensively with NumPy, pandas, and scikit-learn, and often contributes to open source projects related to data science and machine learning.
Read more about Ajay Kumar N

View More author details
Right arrow

Scala for Tree-Based Ensemble Techniques

In the previous chapter, we solved both classification and regression problems using linear models. We also used logistic regression, support vector machine, and Naive Bayes. However, in both cases, we haven't experienced good accuracy because our models showed low confidence.

On the other hand, tree-based and tree ensemble classifiers are really useful, robust, and widely used for both classification and regression tasks. This chapter will provide a quick glimpse at developing these classifiers and regressors using tree-based and ensemble techniques, such as decision trees (DTs), random forests (RF), and gradient boosted trees (GBT), for both classification and regression. More specifically, we will revisit and solve both the regression (from Chapter 2, Scala for Regression Analysis) and classification (from Chapter 3, Scala for Learning...

Technical requirements

Decision trees and tree ensembles

DTs normally fall under supervised learning techniques, which are used to identify and solve problems related to classification and regression. As the name indicates, DTs have various branches—where each branch indicates a possible decision, appearance, or reaction in terms of statistical probability. In terms of features, DTs are split into two main types: the training set and the test set, which helps produce a good update on the predicted labels or classes.

Both binary and multiclass classification problems can be handled by DT algorithms, which is one of the reasons it is used across problems. For instance, for the admission example we introduced in Chapter 3, Scala for Learning Classification, DTs learn from the admission data to approximate a sine curve with a set of if...else decision rules, as shown in the following diagram:

Generating...

Decision trees for supervised learning

In this section, we'll see how to use DTs to solve both regression and classification problems. In the previous two chapters, Chapter 2, Scala for Regression Analysis, and Chapter 3, Scala for Learning Classification, we solved customer churn and insurance-severity claim problems. Those were classification and regression problems, respectively. In both approaches, we used other classic models. However, we'll see how we can solve them with tree-based and ensemble techniques. We'll use the DT implementation from the Apache Spark ML package in Scala.

Decision trees for classification

First of all, we know the customer churn prediction problem in Chapter 3, Scala for Learning...

Gradient boosted trees for supervised learning

In this section, we'll see how to use GBT to solve both regression and classification problems. In the previous two chapters, Chapter 2, Scala for Regression Analysis, and Chapter 3, Scala for Learning Classification, we solved the customer churn and insurance severity claim problems, which were classification and regression problem, respectively. In both approaches, we used other classic models. However, we'll see how we can solve them with tree-based and ensemble techniques. We'll use the GBT implementation from the Spark ML package in Scala.

Gradient boosted trees for classification

We know the customer churn prediction problem from Chapter 3, Scala for Learning...

Random forest for supervised learning

In this section, we'll see how to use RF to solve both regression and classification problems. We'll use DT implementation from the Spark ML package in Scala. Although both GBT and RF are ensembles of trees, the training processes are different. For instance, RF uses the bagging technique to perform the example, while GBT uses boosting. Nevertheless, there are several practical trade-offs between both the ensembles that can pose a dilemma about what to choose. However, RF would be the winner in most of the cases. Here are some justifications:

  • GBTs train one tree at a time, but RF can train multiple trees in parallel. So the training time is lower with RF. However, in some special cases, training and using a smaller number of trees with GBTs is faster and more convenient.
  • RFs are less prone to overfitting. In other words, RFs reduces...

What's next?

So far, we have mostly covered classic and tree-based algorithms for both regression and classification. We saw that the ensemble technique showed the best performance compared to classic algorithms. However, there are other algorithms, such as one-vs-rest algorithm, which work for solving classification problems using other classifiers, such as logistic regression.

Apart from this, neural-network-based approaches, such as multilayer perceptron (MLP), convolutional neural network (CNN), and recurrent neural network (RNN), can also be used to solve supervised learning problems. However, as expected, these algorithms require a large number of training samples and a large computing infrastructure. The datasets we used so far throughout the examples had a few samples. Moreover, those were not so high dimensional. This doesn't mean that we cannot use them to...

Summary

In this chapter, we had a brief introduction to powerful tree-based algorithms, such as DTs, GBT, and RF, for solving both classification and regression tasks. We saw how to develop these classifiers and regressors using tree-based and ensemble techniques. Through two real-world classification and regression problems, we saw how tree ensemble techniques outperform DT-based classifiers or regressors.

We covered supervised learning for both classification and regression on structured and labeled data. However, with the rise of cloud computing, IoT, and social media, unstructured data is growing unprecedentedly, giving more than 80% data, most of which is unlabeled.

Unsupervised learning techniques, such as clustering analysis and dimensionality reduction, are key applications in data-driven research and industry settings to find hidden structures from unstructured datasets...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with Scala Quick Start Guide
Published in: Apr 2019Publisher: PacktISBN-13: 9781789345070
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

author image
Ajay Kumar N

Ajay Kumar N has experience in big data, and specializes in cloud computing and various big data frameworks, including Apache Spark and Apache Hadoop. His primary language of choice is Python, but he also has a special interest in functional programming languages such as Scala. He has worked extensively with NumPy, pandas, and scikit-learn, and often contributes to open source projects related to data science and machine learning.
Read more about Ajay Kumar N