Reader small image

You're reading from  Big Data Analytics with Java

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787288980
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
RAJAT MEHTA
RAJAT MEHTA
author image
RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA

Right arrow

Chapter 7. Decision Trees

Decision trees are one of the simplest (and most popular) of machine learning algorithms, yet they are extremely powerful and used extensively. If you have used a flowchart before, then understanding a decision tree won't be at all difficult for you. A decision tree is a flowchart except in this case, the machine learning algorithm builds this flowchart, for you. Based on the input data, the decision tree algorithm automatically internally creates a knowledge base of a set of rules based on which it can predict an outcome when given a new set of data. In this chapter, we will cover the following topics:

  • Concepts of a decision tree machine learning classifier, including what a decision tree is, how it is built, and how it can be improved

  • The uses of the decision tree

  • A sample case study using decision trees for classification

Let's try to understand the basics of decision trees now.

What is a decision tree?


A decision tree is a machine learning algorithm that belongs to the family of supervised learning algorithms. As such, they rely on training data to train them. From the features on the training data and the target variable, they can learn and build their knowledge base, based on which they can later take decisions on new data. Even though decision trees are mostly used in classification problems, they can be used very well in regression problems also. That is, they can be used to classify between discrete values (such as 'has disease' or 'no disease') or figure out continuous values (such as the price of a commodity based on some rules).

As mentioned earlier, there are two types of decision trees:

  • Decision trees for classification: These are the decision tree algorithms that are used in classification of categorical values, for example, figuring out whether a new customer could be a potential loan defaulter or not.

  • Decision trees for regression: These are the decision...

Summary


In this chapter, we covered a very important and popular algorithm in machine learning called as decision trees. A decision tree is very similar to a flowchart and is based on a set of rules. A decision tree algorithm learns from a dataset and builds a set of rules. Based on these rules, it splits the dataset into two (in the case of binary splits) or more parts. When a new data is fed in for predictions based on the attributes of the data, a particular path is taken and this follows along the full path of rules in the tree until a particular response is reached.

There are many ways in which we can split data in a decision tree. We explored two of the most common ways called Entropy and Gini Impurity. In either of these cases, the main criteria is to use the split mechanism, which makes the split set as homogeneous as possible. Both Entropy and Gini Impurity are mathematical formulas or approaches and as such the entire model works on numerical data.

In the next chapter, we will learn...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Big Data Analytics with Java
Published in: Jul 2017Publisher: PacktISBN-13: 9781787288980
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA