You're reading from Machine Learning with Scala Quick Start Guide

Product typeBook

Published inApr 2019

Reading LevelIntermediate

PublisherPackt

ISBN-139781789345070

Edition1st Edition

Languages

Scala

Concepts

Machine Learning

Authors (2):

Md. Rezaul Karim

Ajay Kumar N

View More author details

Scala for Tree-Based Ensemble Techniques

In the previous chapter, we solved both classification and regression problems using linear models. We also used logistic regression, support vector machine, and Naive Bayes. However, in both cases, we haven't experienced good accuracy because our models showed low confidence.

On the other hand, tree-based and tree ensemble classifiers are really useful, robust, and widely used for both classification and regression tasks. This chapter will provide a quick glimpse at developing these classifiers and regressors using tree-based and ensemble techniques, such as decision trees (DTs), random forests (RF), and gradient boosted trees (GBT), for both classification and regression. More specifically, we will revisit and solve both the regression (from Chapter 2, Scala for Regression Analysis) and classification (from Chapter 3, Scala for Learning...

Technical requirements

Make sure Scala 2.11.x and Java 1.8.x are installed and configured on your machine.

The code files of this chapters can be found on GitHub:

https://github.com/PacktPublishing/Machine-Learning-with-Scala-Quick-Start-Guide/tree/master/Chapter04

Check out the following playlist to see the Code in Action video for this chapter:
http://bit.ly/2WhQf2i

Decision trees and tree ensembles

DTs normally fall under supervised learning techniques, which are used to identify and solve problems related to classification and regression. As the name indicates, DTs have various branches—where each branch indicates a possible decision, appearance, or reaction in terms of statistical probability. In terms of features, DTs are split into two main types: the training set and the test set, which helps produce a good update on the predicted labels or classes.

Both binary and multiclass classification problems can be handled by DT algorithms, which is one of the reasons it is used across problems. For instance, for the admission example we introduced in Chapter 3, Scala for Learning Classification, DTs learn from the admission data to approximate a sine curve with a set of if...else decision rules, as shown in the following diagram:

Generating...

Decision trees for supervised learning

In this section, we'll see how to use DTs to solve both regression and classification problems. In the previous two chapters, Chapter 2, Scala for Regression Analysis, and Chapter 3, Scala for Learning Classification, we solved customer churn and insurance-severity claim problems. Those were classification and regression problems, respectively. In both approaches, we used other classic models. However, we'll see how we can solve them with tree-based and ensemble techniques. We'll use the DT implementation from the Apache Spark ML package in Scala.

Decision trees for classification

First of all, we know the customer churn prediction problem in Chapter 3, Scala for Learning...

Gradient boosted trees for supervised learning

In this section, we'll see how to use GBT to solve both regression and classification problems. In the previous two chapters, Chapter 2, Scala for Regression Analysis, and Chapter 3, Scala for Learning Classification, we solved the customer churn and insurance severity claim problems, which were classification and regression problem, respectively. In both approaches, we used other classic models. However, we'll see how we can solve them with tree-based and ensemble techniques. We'll use the GBT implementation from the Spark ML package in Scala.

Gradient boosted trees for classification

We know the customer churn prediction problem from Chapter 3, Scala for Learning...

Random forest for supervised learning

In this section, we'll see how to use RF to solve both regression and classification problems. We'll use DT implementation from the Spark ML package in Scala. Although both GBT and RF are ensembles of trees, the training processes are different. For instance, RF uses the bagging technique to perform the example, while GBT uses boosting. Nevertheless, there are several practical trade-offs between both the ensembles that can pose a dilemma about what to choose. However, RF would be the winner in most of the cases. Here are some justifications:

GBTs train one tree at a time, but RF can train multiple trees in parallel. So the training time is lower with RF. However, in some special cases, training and using a smaller number of trees with GBTs is faster and more convenient.
RFs are less prone to overfitting. In other words, RFs reduces...

What's next?

So far, we have mostly covered classic and tree-based algorithms for both regression and classification. We saw that the ensemble technique showed the best performance compared to classic algorithms. However, there are other algorithms, such as one-vs-rest algorithm, which work for solving classification problems using other classifiers, such as logistic regression.

Apart from this, neural-network-based approaches, such as multilayer perceptron (MLP), convolutional neural network (CNN), and recurrent neural network (RNN), can also be used to solve supervised learning problems. However, as expected, these algorithms require a large number of training samples and a large computing infrastructure. The datasets we used so far throughout the examples had a few samples. Moreover, those were not so high dimensional. This doesn't mean that we cannot use them to...

Summary

In this chapter, we had a brief introduction to powerful tree-based algorithms, such as DTs, GBT, and RF, for solving both classification and regression tasks. We saw how to develop these classifiers and regressors using tree-based and ensemble techniques. Through two real-world classification and regression problems, we saw how tree ensemble techniques outperform DT-based classifiers or regressors.

We covered supervised learning for both classification and regression on structured and labeled data. However, with the rise of cloud computing, IoT, and social media, unstructured data is growing unprecedentedly, giving more than 80% data, most of which is unlabeled.

Unsupervised learning techniques, such as clustering analysis and dimensionality reduction, are key applications in data-driven research and industry settings to find hidden structures from unstructured datasets...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning with Scala Quick Start Guide

Published in: Apr 2019Publisher: PacktISBN-13: 9781789345070

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

Ajay Kumar N

Ajay Kumar N has experience in big data, and specializes in cloud computing and various big data frameworks, including Apache Spark and Apache Hadoop. His primary language of choice is Python, but he also has a special interest in functional programming languages such as Scala. He has worked extensively with NumPy, pandas, and scikit-learn, and often contributes to open source projects related to data science and machine learning.
Read more about Ajay Kumar N

Other recommended products

Related to this chapter

Scala Machine Learning Projects

Scala is one of the widely used programming language in the world when it comes to handle large amount of data. With the rise of machine learning, data scientists and machine learning experts do prefer scala as a language in order to handle and scale efficient machine learning applications. You will be acquainted with the popular deep/machine learning libraries for Scala such as Spark ML/MLlib, H2O, DeepLearning4j, MXNET etc., and will use their features to build and deploy projects on a framework such as Apache Spark. By the end of this book, you will be able to dominate numerical computing, deep learning, and functional programming to carry out complex advanced tasks with ease.

BookJan 2018470 pages

Java Deep Learning Projects

You will build full-fledged, deep learning applications with Java and different open-source libraries. Master numerical computing, deep learning, and the latest Java programming features to carry out complex advanced tasks. This book is filled with best practices/tips after every project to help you optimize your deep learning models with ease.

BookJun 2018436 pages

Hands-On Deep Learning for IoT

This book will provide you an overview of Deep Learning techniques to facilitate the analytics and learning in various IoT apps. We will take you through each process - from data collection, analysis, modeling, statistics, and monitoring. We will make IoT data speak with a set of popular frameworks, like TensorFlow, TensorFlow Lite, and Chainer.

BookJun 2019308 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Hands-On Recommendation Systems with Python

Recommendation systems are at the heart of almost every internet business today; from Facebook to Netflix to Amazon. Providing good recommendations, whether it's friends, movies or groceries, goes a long way in defining user experience and enticing your customers to use and buy from your platform. This book teaches you to do just that.

BookJul 2018146 pages

Predictive Analytics with TensorFlow

Predictive decisions are becoming a huge trend worldwide, catering to wide industry sectors by predicting which decisions are more likely to give maximum results. Data mining, statistics, and machine learning allow users to discover predictive intelligence by uncovering patterns and showing the relationship between structured and unstructured data. This book will help you build solutions that will make automated decisions. In the end, tune and build your own predictive analytics model with the help of TensorFlow.

BookNov 2017522 pages

Machine Learning with Spark

Spark ML is the machine learning module of Spark. It uses in-memory RDDs to process machine learning models faster for clustering, classification, and regression.

BookApr 2017532 pages

TensorFlow: Powerful Predictive Analytics with TensorFlow

Predictive analytics discovers hidden patterns from structured and unstructured data for automated decision making in business intelligence. Predictive decisions are becoming a huge trend worldwide, catering to wide industry sectors by predicting which decisions are more likely to give maximum results. TensorFlow, Google’s brainchild, is immensely popular and extensively used for predictive analysis.

BookMar 2018164 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Apache Spark 2.x Machine Learning Cookbook

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.

BookSep 2017666 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages