Reader small image

You're reading from  Hands-On Machine Learning with Microsoft Excel 2019

Product typeBook
Published inApr 2019
PublisherPackt
ISBN-139781789345377
Edition1st Edition
Tools
Right arrow
Author (1)
Julio Cesar Rodriguez Martino
Julio Cesar Rodriguez Martino
author image
Julio Cesar Rodriguez Martino

Julio Cesar Rodriguez Martino is a machine learning (ML) and artificial intelligence (AI) platform architect, focusing on applying the latest techniques and models in these fields to optimize, automate, and improve the work of tax and accounting consultants. The main tool used in this practice is the MS Office platform, which Azure services complement perfectly by adding intelligence to the different tasks. Julio's background is in experimental physics, where he learned and applied advanced statistical and data analysis methods. He also teaches university courses and provides in-company training on machine learning and analytics, and has a lot of experience leading data science teams.
Read more about Julio Cesar Rodriguez Martino

Right arrow

Hands-On Examples of Machine Learning Models

Supervised learning is the simplest way of teaching a model about how the world looks. Showing how a given combination of input variables leads to a certain output, that is, using labeled data, makes it possible for a computer to predict the output for another similar dataset that it has never seen. Unsupervised learning deals with finding patterns and useful insights into non-labeled data.

We will study different types of machine learning models, trying to understand the details and actually performing the necessary calculations so that the inner workings of these models are clear and reproducible.

In this chapter, the following topics will be covered:

  • Understanding supervised learning with multiple linear regression
  • Understanding supervised learning with decision trees
  • Understanding unsupervised learning with clustering
...

Technical requirements

There are no technical requirements for this chapter. We just need to input the values shown in the tables within each section in an Excel sheet in order to follow the explanation closely.

Understanding supervised learning with multiple linear regression

In the previous chapter, we followed an example of linear regression using two variables. It is interesting to see how we can apply regression to more than two variables (called multiple linear regression) and extract useful information from the results.

Suppose that you are asked to test whether there exists a hidden policy of gender discrimination in a company. You could be working for a law firm that is leading a trial against this company, and they need data-based evidence to back up their claim.

You would start by taking a sample of the company's payroll, including several variables that describe each employee and the last salary increase amount. The following screenshot shows a set of values after they've been entered in an Excel worksheet:

There are four numerical features in the dataset:

  • ID:...

Understanding supervised learning with decision trees

The decision tree algorithm uses a tree-like model of decisions. Its name is derived from the graphical representation of the cascading process that partitions the records. The algorithm chooses the input variables that better split the dataset into subsets that are more pure in terms of the target variable, ideally a subset that contains only one value of this variable. Decision trees are some of the most widely used and easy to understand classification algorithms.

The outcome of the tree algorithm calculation is a set of simple rules that explain which values or intervals of the input values split the original data better. The fact that the results and the path followed to get to them can be clearly shown gives decision trees an advantage over other algorithms. Explainability is a serious problem for some machine learning...

Understanding unsupervised learning with clustering

Clustering is a statistical method that attempts to group the points in a dataset according to a distance measure, usually the Euclidean distance, which calculates the root of the squared differences between coordinates of a pair of points. To put this simply, those points that are classified within the same cluster are closer (in terms of the distance defined) to each other than they are to the points belonging to other clusters. At the same time, the larger the distance between two clusters, the better we can distinguish them. This is similar to saying that we try to build groups in which members are more alike and are more different to members of other groups.

It is clear that the most important part of a clustering algorithm is to define and calculate the distance between two given points and to iteratively assign the points...

Summary

In this chapter, we have described real life examples of supervised and unsupervised machine learning models that have been applied to solving problems. We covered multiple regression, decision trees, and clustering. We have also shown how to choose and transform the input variables or features to be ingested by the models.

This chapter only shows the basic principles of each algorithm. In real data analysis and prediction using machine learning, models are already programmed and can be used as black boxes. It is, therefore, extremely important to understand the basics of each model and know whether we are using it correctly.

In the following chapters, we will focus on how to extract the data from different sources, transform it according to our needs, and use previously built models for analysis.

Questions

  1. Why is it important to encode categorical features?
  2. What are the different ways to stop a decision tree calculation?
  3. Temperature_hot has an entropy value of one in the example. Why?
  4. Following the diagram of the decision tree at the beginning of the Understanding supervised learning with decision trees section, what would be the path to decide whether or not to train outside? Consider using IF statements.
  5. Would the cluster distribution change if we choose different starting points? You can read about this in the recommended articles.
  6. Is the clustering that's obtained with iterative analysis the same as the one that's determined visually? Why?

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Machine Learning with Microsoft Excel 2019
Published in: Apr 2019Publisher: PacktISBN-13: 9781789345377
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Julio Cesar Rodriguez Martino

Julio Cesar Rodriguez Martino is a machine learning (ML) and artificial intelligence (AI) platform architect, focusing on applying the latest techniques and models in these fields to optimize, automate, and improve the work of tax and accounting consultants. The main tool used in this practice is the MS Office platform, which Azure services complement perfectly by adding intelligence to the different tasks. Julio's background is in experimental physics, where he learned and applied advanced statistical and data analysis methods. He also teaches university courses and provides in-company training on machine learning and analytics, and has a lot of experience leading data science teams.
Read more about Julio Cesar Rodriguez Martino