Reader small image

You're reading from  Deep Learning for Beginners

Product typeBook
Published inSep 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781838640859
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Dr. Pablo Rivas
Dr. Pablo Rivas
author image
Dr. Pablo Rivas

Dr. Pablo Rivas is an assistant professor of computer science at Baylor University in Texas. He worked in industry for a decade as a software engineer before becoming an academic. He is a senior member of the IEEE, ACM, and SIAM. He was formerly at NASA Goddard Space Flight Center performing research. He is an ally of women in technology, a deep learning evangelist, machine learning ethicist, and a proponent of the democratization of machine learning and artificial intelligence in general. He teaches machine learning and deep learning. Dr. Rivas is a published author and all his papers are related to machine learning, computer vision, and machine learning ethics. Dr. Rivas prefers Vim to Emacs and spaces to tabs.
Read more about Dr. Pablo Rivas

Right arrow
Preparing Data

Now that you have successfully prepared your system to learn about deep learning, see Chapter 2, Setup and Introduction to Deep Learning Frameworks, we will proceed to give you important guidelines about data that you may encounter frequently when practicing deep learning. When it comes to learning about deep learning, having well-prepared datasets will help you to focus more on designing your models rather than preparing your data. However, everyone knows that this is not a realistic expectation and if you ask any data scientist or machine learning professional about this, they will tell you that an important aspect of modeling is knowing how to prepare your data. Knowing how to deal with your data and how to prepare it will save you many hours of work that you can spend fine-tuning your models. Any time spent preparing your data is time well invested indeed.

This...

Binary data and binary classification

In this section, we will focus all our efforts on preparing data with binary inputs or targets. By binary, of course, we mean values that can be represented as either 0 or 1. Notice the emphasis on the words represented as. The reason is that a column may contain data that is not necessarily a 0 or a 1, but could be interpreted as or represented by a 0 or a 1.

Consider the following fragment of a dataset:

x1

x2

...

y

0

5

...

a

1

7

...

a

1

5

...

b

0

7

...

b

In this short dataset example with only four rows, the column x1 has values that are clearly binary and are either 0 or a 1. However, x2, at first glance, may not be perceived as binary, but if you pay close attention, the only values in that column are either 5 or 7. This means that the data can be correctly and uniquely mapped to a set of two values. Therefore, we could map 5 to 0, and 7 to 1, or vice versa; it does not really matter.

A similar...

Categorical data and multiple classes

Now that you know how to binarize data for different purposes, we can look into other types of data, such as categorical or multi-labeled data, and how to make them numeric. Most advanced deep learning algorithms, in fact, only accept numerical data. This is merely a design issue that can easily be solved later on, and it is not a big deal because you will learn there are easy ways to take categorical data and convert it to a meaningful numerical representation.

Categorical data has information embedded as distinct categories. These categories can be represented as numbers or as strings. For example, a dataset that has a column named country with items such as "India", "Mexico", "France", and "U.S". Or, a dataset with zip codes such as 12601, 85621, and 73315. The former is non-numeric categorical data, and the latter is numeric categorical data. Country names would need to be converted to a number to be usable...

Real-valued data and univariate regression

Knowing how to deal with categorical data is very important when using classification models based on deep learning; however, knowing how to prepare data for regression is as important. Data that contains continuous-like real values, such as temperature, prices, weight, speed, and others, is suitable for regression; that is, if we have a dataset with columns of different types of values, and one of those is real-valued data, we could perform regression on that column. This implies that we could use all the rest of the dataset to predict the values on that column. This is known as univariate regression, or regression on one variable.

Most machine learning methodologies work better if the data for regression is normalized. By that, we mean that the data will have special statistical properties that will make calculations more stable. This is critical for many deep learning algorithms that suffer from vanishing or exploding gradients (Hanin, B....

Altering the distribution of data

It has been demonstrated that changing the distribution of the targets, particularly in the case of regression, can have positive benefits in the performance of a learning algorithm (Andrews, D. F., et al. (1971)).

Here, we'll discuss one particularly useful transformation known as Quantile Transformation. This methodology aims to look at the data and manipulate it in such a way that its histogram follows either a normal distribution or a uniform distribution. It achieves this by looking at estimates of quantiles.

We can use the following commands to transform the same data as in the previous section:

from sklearn.preprocessing import QuantileTransformer
transformer = QuantileTransformer(output_distribution='normal')
df[[4,9]] = transformer.fit_transform(df[[4,9]])

This will effectively map the data into a new distribution, namely, a normal distribution.

Here, the term normal distribution refers to a Gaussian-like probability density function...

Data augmentation

Now that you have learned how to process the data to have specific distributions, it is important for you to know about data augmentation, which is usually associated with missing data or high-dimensional data. Traditional machine learning algorithms may have problems dealing with data where the number of dimensions surpasses the number of samples available. The problem is not particular to all deep learning algorithms, but some algorithms have a much more difficult time learning to model a problem that has more variables to figure out than samples to work on. We have a few options to correct that: either we reduce the dimensions or variables (see the following section) or we increase the samples in our dataset (this section).

One of the tools for adding more data is known as data augmentation (Van Dyk, D. A., and Meng, X. L. (2001)). In this section, we will use the MNIST dataset to exemplify a few techniques for data augmentation that are particular to images but...

Data dimensionality reduction

As pointed out before, if we have the problem of having more dimensions (or variables) than samples in our data, we can either augment the data or reduce the dimensionality of the data. Now, we will address the basics of the latter.

We will look into reducing dimensions both in supervised and unsupervised ways with both small and large datasets.

Supervised algorithms

Supervised algorithms for dimensionality reduction are so called because they take the labels of the data into account to find better representations. Such methods often yield good results. Perhaps the most popular kind is called linear discriminant analysis (LDA), which we'll discuss next.

Linear discriminant analysis

Scikit learn has a LinearDiscriminantAnalysis class that can easily perform dimensionality reduction on a desired number of components.

By number of components, the number of dimensions desired is understood. The name comes from principal component analysis (PCA), which is...

Ethical implications of manipulating data

There are many ethical implications and risks when manipulating data that you need to know. We live in a world where most deep learning algorithms will have to be corrected, by re-training them, because it was found that they were biased or unfair. That is very unfortunate; you want to be a person who exercises responsible AI and produces carefully thought out models.

When manipulating data, be careful about removing outliers from the data just because you think they are decreasing your model's performance. Sometimes, outliers represent information about protected groups or minorities, and removing those perpetuates unfairness and introduces bias toward the majority groups. Avoid removing outliers unless you are absolutely sure that they are errors caused by faulty sensors or human error.

Be careful of the way you transform the distribution of the data. Altering the distribution is fine in most cases, but if you are dealing with demographic...

Summary

In this chapter, we discussed many data manipulation techniques that we will come back to use all the time. It is good for you to spend time doing this now rather than later. It will make our modeling of deep learning architectures easier.

After reading this chapter, you are now able to manipulate and produce binary data for classification or for feature representation. You also know how to deal with categorical data and labels and prepare it for classification or regression. When you have real-valued data, you now know how to identify statistical properties and how to normalize such data. If you ever have the problem of data that has non-normal or non-uniform distributions, now you know how to fix that. And if you ever encounter problems of not having enough data, you learned a few data augmentation techniques. Toward the end of this chapter, you learned some of the most popular dimensionality reduction techniques. You will learn more of these along the road, for example, when...

Questions and answers

  1. Which variables of the heart dataset are suitable for regression?

Actually, all of them. But the ideal ones are those that are real-valued.

  1. Does the scaling of the data change the distribution of the data?

No. The distribution remains the same. Statistical metrics such as the mean and variance may change, but the distribution remains the same.

  1. What is the main difference between supervised and unsupervised dimensionality reduction methods?

Supervised algorithms use the target labels, while unsupervised algorithms do not need that information.

  1. When is it better to use batch-based dimensionality reduction?

When you have very large datasets.

References

  • Cleveland Heart Disease Dataset (1988). Principal investigators:
    a. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
    b. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
    c. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
    d. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
  • Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.J., Sandhu, S., Guppy, K.H., Lee, S. and Froelicher, V., (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. The American journal of cardiology, 64(5), 304-310.
  • Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research (best of the web). IEEE Signal Processing Magazine, 29(6), 141-142.
  • Sezgin, M., and Sankur, B. (2004). Survey over image thresholding techniques and quantitative performance evaluation. Journal of Electronic imaging, 13(1), 146-166.
  • Potdar...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Learning for Beginners
Published in: Sep 2020Publisher: PacktISBN-13: 9781838640859
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dr. Pablo Rivas

Dr. Pablo Rivas is an assistant professor of computer science at Baylor University in Texas. He worked in industry for a decade as a software engineer before becoming an academic. He is a senior member of the IEEE, ACM, and SIAM. He was formerly at NASA Goddard Space Flight Center performing research. He is an ally of women in technology, a deep learning evangelist, machine learning ethicist, and a proponent of the democratization of machine learning and artificial intelligence in general. He teaches machine learning and deep learning. Dr. Rivas is a published author and all his papers are related to machine learning, computer vision, and machine learning ethics. Dr. Rivas prefers Vim to Emacs and spaces to tabs.
Read more about Dr. Pablo Rivas