Reader small image

You're reading from  Hands-On Machine Learning with Microsoft Excel 2019

Product typeBook
Published inApr 2019
PublisherPackt
ISBN-139781789345377
Edition1st Edition
Tools
Right arrow
Author (1)
Julio Cesar Rodriguez Martino
Julio Cesar Rodriguez Martino
author image
Julio Cesar Rodriguez Martino

Julio Cesar Rodriguez Martino is a machine learning (ML) and artificial intelligence (AI) platform architect, focusing on applying the latest techniques and models in these fields to optimize, automate, and improve the work of tax and accounting consultants. The main tool used in this practice is the MS Office platform, which Azure services complement perfectly by adding intelligence to the different tasks. Julio's background is in experimental physics, where he learned and applied advanced statistical and data analysis methods. He also teaches university courses and provides in-company training on machine learning and analytics, and has a lot of experience leading data science teams.
Read more about Julio Cesar Rodriguez Martino

Right arrow

Data Cleansing and Preliminary Data Analysis

After acquiring the right data, the most difficult and time-consuming task is getting it ready for analysis. Understanding what can and cannot be done with a given dataset is the first step before any model can be used. This chapter demonstrates how to use Excel functions to search and replace patterns, and how to find wrong data types and missing data. It also contains some useful diagrams so that we can get insights from the data and understand the different variables.

In this chapter, we will cover the following topics:

  • Cleansing data
  • Visualizing data for preliminary analysis
  • Understanding unbalanced datasets

Technical requirements

Cleansing data

Data is never clean – it always contains missing values, errors, incorrect formats, and other problems that make it impossible to feed to a machine learning model without preprocessing. This is what data cleansing is all about – correcting all these problems before starting the real analysis.

As an example of how to clean a dataset, we will use the Titanic passengers dataset. We will repeat the procedure described in the Importing data from another Excel workbook section of the previous chapter, to import data from an Excel workbook. We will use real data from the Titanic passengers and demonstrate how you can prepare it for analysis.

To clean a dataset, perform the necessary steps, as follows:

  1. Navigate to Data | From File | From Workbook, as shown in the following screenshot:
  1. After selecting the titanic.xlsx file and the Passenger data worksheet...

Visualizing data for preliminary analysis

After cleaning the dataset, it is always recommended to visualize it. This helps us gain an understanding of the different variables, how their values are distributed, and the correlations that exist between them (we will explore correlations in more detail in the next chapter). We can determine which variables are important to our analyses, which ones give us more information, and which ones can be discarded for being redundant.

We will start by looking a several bar plots, where we will either count the number of occurrences of each value (using a histogram), or we will show the percentage of each value with respect to the total (using a bar plot). To achieve this, perform the following steps:

  1. Right-click on any cell within the table to access the Quick Analysis option:
  1. In the pop-up window, we can choose the chart type. Select Clustered...

Understanding unbalanced datasets

To be able to compare the results of different variables, we need to take into account the different number of samples of each class. Let's suppose that we want to train a machine learning model to predict whether a given passenger would survive, based on age group, gender, and class. If we plot the distribution of values in the survived variable, we will see the following:

It is clear from the preceding diagram and table that there are nearly twice as many non-survivors than survivors. If we use this dataset as is, we are introducing a bias to our dataset that will affect the results. Predicting 0 for the survival variable will be approximately two times more probable than predicting 1. An exception to this statement are decision trees and their related predictive models (such as random forests and XGBoost), which can correctly deal with...

Summary

In this chapter, we explored different methods of dealing with missing data and learned how to group or summarize it. We have shown you how important it is to visualize the data after cleaning, in order to be able to understand and interpret the results, from basic to more advanced model predictions. This is the beginning of any feature engineering, since we transform and/or discard features based on their values. Too many missing values will imply that we cannot use that variable (or feature), or a high correlation will imply that we can discard one of the correlated variables. We will dive deeper into correlations in the next chapter, showing you how to measure them quantitatively, using different methods.

Preliminary data visualization is extremely important to gain an understanding of data properties and to interpret the results we obtain, even after applying a machine...

Questions

  1. Reviewing what was explained in the previous chapter, build a decision tree using class, gender, and Age group as features, and survived as the target variable. You should be able to define some conditions for a passenger to survive.
  2. What variables in the dataset do you think are highly correlated?
  3. Suppose that the dataset contains a numerical variable with only a few missing values. Is it possible to replace these missing values with numerical values? What value would you use?
  4. Explain what bias means and why it is important to avoid it.
  5. What other types of diagrams could be use for preliminary data analysis? Try some of them in the given dataset.

Further reading

  • Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data, 1st Edition, by Jason W. Osborn
  • Introduction to Data Visualization Techniques Using Microsoft Excel 2013 & Web-based Tools, by Tufts Data Lab
  • Analysing the Classification of Imbalanced Data-sets with Multiple Classes: Binarization Techniques and Ad-Hoc Approaches for Preprocessing and Cost Sensitive Learning, by A. Fernández, V. López, M. Galar, M.J. del Jesus and F. Herrera
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Machine Learning with Microsoft Excel 2019
Published in: Apr 2019Publisher: PacktISBN-13: 9781789345377
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Julio Cesar Rodriguez Martino

Julio Cesar Rodriguez Martino is a machine learning (ML) and artificial intelligence (AI) platform architect, focusing on applying the latest techniques and models in these fields to optimize, automate, and improve the work of tax and accounting consultants. The main tool used in this practice is the MS Office platform, which Azure services complement perfectly by adding intelligence to the different tasks. Julio's background is in experimental physics, where he learned and applied advanced statistical and data analysis methods. He also teaches university courses and provides in-company training on machine learning and analytics, and has a lot of experience leading data science teams.
Read more about Julio Cesar Rodriguez Martino