Reader small image

You're reading from  Hands-On Data Preprocessing in Python

Product typeBook
Published inJan 2022
PublisherPackt
ISBN-139781801072137
Edition1st Edition
Concepts
Right arrow
Author (1)
Roy Jafari
Roy Jafari
author image
Roy Jafari

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
Read more about Roy Jafari

Right arrow

Chapter 14: Data Transformation and Massaging

Congratulations, you've made your way to the last chapter of the third part of the book – The Preprocessing. In this part of the book, we have so far covered data cleaning, data integration, and data reduction. In this chapter, we will add the last piece to the arsenal of our data preprocessing tools – data transformation and massaging.

Data transformation normally is the last data preprocessing that is applied to our datasets. The dataset may need to be transformed to be ready for a prescribed analysis, or a specific transformation might help a certain analytics tool to perform better, or simply without a correct data transformation, the results of our analysis might be misleading.

In this chapter, we will cover when and where we need data transformation. Furthermore, we will cover the many techniques that are needed for every data preprocessing situation. In this chapter, we're going to cover the following...

Technical requirements

You will be able to find all of the code and the dataset that is used in this book in a GitHub repository exclusively created for this book. To find the repository, go to: https://github.com/PacktPublishing/Hands-On-Data-Preprocessing-in-Python. You can find this chapter in this repository and download the code and the data for better learning.

The whys of data transformation and massaging

Data transformation comes at the very last stage of data preprocessing, right before using the analytic tools. At this stage of data preprocessing, the dataset already has the following characteristics.

  • Data cleaning: The dataset is cleaned at all three cleaning levels (Chapters 9–11).
  • Data integration: All the potentially beneficial data sources are recognized and a dataset that includes the necessary information is created (Chapter 12, Data Fusion and Integration).
  • Data reduction: If needed, the size of the dataset has been reduced (Chapter 13, Data Reduction).

At this stage of data preprocessing, we may have to make some changes to the data before moving to the analyzing stage. The dataset will undergo the changes for one of the following reasons: we will call them necessity, correctness, and effectiveness. The following list provides more detail for each reason.

  • Necessity: The analytic method cannot...

Normalization and standardization

At different points during our journey in this book, we've already talked about and used normalization and standardization. For instance, before applying K-Nearest Neighbors (KNN) in Chapter 7, Classification, and before using K-means on our dataset in Chapter 8, Clustering Analysis, we used normalization. Furthermore, before applying Principal Component Analysis (PCA) to our dataset for unsupervised dimension reduction in Chapter 13, Data Reduction, we used standardization.

Here is the general rule of when we need normalization or standardization. We need normalization when we need the range of all the attributes in a dataset to be equal. This will be needed especially for algorithmic data analytics that uses the distance between the data objects. Examples of such algorithms are K-means and KNN. On the other hand, we need standardization when we need the variance and/or the standard deviation of all the attributes to be equal. We saw an example...

Binary coding, ranking transformation, and discretization

In our analytics journey, there will be many instances in which we want to transform our data from numerical representation to categorical representation, or vice versa. To do these transformations, we will have to use one of three tools: binary coding, ranking transformation, and discretization.

As the following figure shows, to switch from Categories to Numbers, we either have to use Binary Coding or Ranking Transformation, and to switch from numbers to categories, we need to use Discretization:

Figure 14.3 – Direction of application for binary coding, ranking transformation, and discretization

One question that the preceding figure might bring to mind is, how do we know which one we choose when we want to move from categories to numbers: binary coding or ranking transformation? The answer is simple.

If the categories are nominal, we can only use binary coding; if they are ordinal, both...

Attribute construction

We've already seen an example of this type of data transformation. We saw that we could employ it to transform categorical attributes into numerical ones. As we discussed, using attribute construction requires having a deep understanding of the environment that the data has been collected from. For instance, in Figure 14.6, we were able to construct the Education Years attribute from Education level because we have a pretty good idea of the working of the education system in the environment the data was collected from.

Attribute construction can also be done by combining more than one attribute. Let's see an example and learn how this could be possible.

Example – construct one transformed attribute from two attributes

Do you know what Body Mass Index (BMI) is? BMI is a result of attribute construction by researchers and physicians, who were looking for a healthiness index that takes both the weight and height of individuals into account...

Feature extraction

This type of data transformation is very similar to attribute construction. In both, we use our deep knowledge of the original data to drive transformed attributes that are more helpful for our analysis purposes.

In attribute construction, we either come up with a completely new attribute from scratch or combine some attributes to make a transformed attribute that is more useful; however, in feature extraction, we unpack and pick apart a single attribute and only keep what is useful for our analysis.

As always, we will go for the best way to learn what we just discussed – examples! We will see some illuminative examples in this arena.

Example – extract three attributes from one attribute

The following figure shows the transformation of the Email attribute into three binary attributes. Every email ends with @aWebAddress; by looking at the website address providing the email service, we have extracted the three Popular Free Platform, .edu...

Log transformation

We should use this data transformation when an attribute experiences exponential growth and decline across the population of our data objects. When you draw a box plot of these attributes, you expect to see fliers, but those are not mistaken records, nor are they unnatural outliers. Those significantly larger or smaller values come naturally from the environment.

Attributes with exponential growth or decline may be problematic for data visualization and clustering analysis; furthermore, they can be problematic for some prediction and classification algorithms where the method uses the distance between the data objects, such as KNN, or where the method drives its performance based on collective performance metrics, such as linear regression.

These attributes may sound very hard to deal with, but there is a very easy fix for them – log transformation. In short, instead of using the attribute, you calculate the logarithms of all of the values and use them...

Smoothing, aggregation, and binning

In our discussion about noise in data in Chapter 11, Data Cleaning Level III – Missing Values, Outliers, and Errors, we learned that there are two types of errors – systematic errors and unavoidable noise. In Chapter 11, Data Cleaning Level III – Missing Values, Outliers, and Errors, we discussed how we deal with systematic errors, and now here we will discuss noise. This is not covered under data cleaning, because noise is an unavoidable part of any data collection, so it cannot be discussed as data cleaning. However, here we will discuss it under data transformation, as we may be able to take measures to best handle it. The three methods that can help deal with noise are smoothing, aggregation, and binning.

It might seem surprising that these methods are only applied to time-series data to deal with noise. However, there is a distinct and definitive reason for it. You see, it is only in time-series data, or any data that...

Summary

Congratulations to you for completing this chapter. In this chapter, we added many useful tools to our data preprocessing armory, specifically in the data transformation area. We learned how to distinguish between data transformation and data massaging. Furthermore, we learned how to transform our data from numerical to categorical, and vice versa. We learned about attribute construction and feature extraction, which are very useful for high-level data analysis. We also learned about log transformation, which is one of the oldest and most effective tools. And lastly, we learned three methods that are very useful in our arsenal for dealing with noise in data.

By finishing this chapter successfully, you are also coming to the end of the third part of this book – The Preprocessing. By now, you know enough to be very successful at preprocessing data that leads to effective data analytics. In the next part of the book, we will have three case studies (Chapters 15–...

Exercise

  1. In your own words, what are the differences and similarities between normalization and standardization? How come some use them interchangeably?
  2. There are two instances of data transformation done during the discussion of binary coding, ranking transformation, and discretization that can be labeled as massaging. Try to spot them and explain how come they can be labeled that way.
  3. Of course, we know that one of the ways that the color of a data object is presented is by using their names. This is why we would assume color probably should be a nominal attribute. However, you can transform this usually nominal attribute to a numerical one. What are the two possible approaches? (Hint: one of them is an attribute construction using RGB coding.) Apply the two approaches to the following small dataset. The data shown in the table below is accessible in the color_nominal.csv file:

    Figure 14.27 – color_nominal.csv

    Once after binary codding and once after RGB attribute...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Preprocessing in Python
Published in: Jan 2022Publisher: PacktISBN-13: 9781801072137
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Roy Jafari

Roy Jafari, Ph.D. is an assistant professor of business analytics at the University of Redlands. Roy has taught and developed college-level courses that cover data cleaning, decision making, data science, machine learning, and optimization. Roy's style of teaching is hands-on and he believes the best way to learn is to learn by doing. He uses active learning teaching philosophy and readers will get to experience active learning in this book. Roy believes that successful data preprocessing only happens when you are equipped with the most efficient tools, have an appropriate understanding of data analytic goals, are aware of data preprocessing steps, and can compare a variety of methods. This belief has shaped the structure of this book.
Read more about Roy Jafari