You're reading from Deep Learning for Beginners

Product typeBook

Published inSep 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781838640859

Edition1st Edition

Languages

Python

Tools

Keras

Concepts

Deep Learning

Author (1)

Dr. Pablo Rivas

Preparing Data

Now that you have successfully prepared your system to learn about deep learning, see Chapter 2, Setup and Introduction to Deep Learning Frameworks, we will proceed to give you important guidelines about data that you may encounter frequently when practicing deep learning. When it comes to learning about deep learning, having well-prepared datasets will help you to focus more on designing your models rather than preparing your data. However, everyone knows that this is not a realistic expectation and if you ask any data scientist or machine learning professional about this, they will tell you that an important aspect of modeling is knowing how to prepare your data. Knowing how to deal with your data and how to prepare it will save you many hours of work that you can spend fine-tuning your models. Any time spent preparing your data is time well invested indeed.

This...

Binary data and binary classification

In this section, we will focus all our efforts on preparing data with binary inputs or targets. By binary, of course, we mean values that can be represented as either 0 or 1. Notice the emphasis on the words represented as. The reason is that a column may contain data that is not necessarily a 0 or a 1, but could be interpreted as or represented by a 0 or a 1.

Consider the following fragment of a dataset:

x₁	x₂	...	y
0	5	...	a
1	7	...	a
1	5	...	b
0	7	...	b

In this short dataset example with only four rows, the column x₁ has values that are clearly binary and are either 0 or a 1. However, x₂, at first glance, may not be perceived as binary, but if you pay close attention, the only values in that column are either 5 or 7. This means that the data can be correctly and uniquely mapped to a set of two values. Therefore, we could map 5 to 0, and 7 to 1, or vice versa; it does not really matter.

A similar...

Categorical data and multiple classes

Now that you know how to binarize data for different purposes, we can look into other types of data, such as categorical or multi-labeled data, and how to make them numeric. Most advanced deep learning algorithms, in fact, only accept numerical data. This is merely a design issue that can easily be solved later on, and it is not a big deal because you will learn there are easy ways to take categorical data and convert it to a meaningful numerical representation.

Categorical data has information embedded as distinct categories. These categories can be represented as numbers or as strings. For example, a dataset that has a column named country with items such as "India", "Mexico", "France", and "U.S". Or, a dataset with zip codes such as 12601, 85621, and 73315. The former is non-numeric categorical data, and the latter is numeric categorical data. Country names would need to be converted to a number to be usable...

Real-valued data and univariate regression

Knowing how to deal with categorical data is very important when using classification models based on deep learning; however, knowing how to prepare data for regression is as important. Data that contains continuous-like real values, such as temperature, prices, weight, speed, and others, is suitable for regression; that is, if we have a dataset with columns of different types of values, and one of those is real-valued data, we could perform regression on that column. This implies that we could use all the rest of the dataset to predict the values on that column. This is known as univariate regression, or regression on one variable.

Most machine learning methodologies work better if the data for regression is normalized. By that, we mean that the data will have special statistical properties that will make calculations more stable. This is critical for many deep learning algorithms that suffer from vanishing or exploding gradients (Hanin, B....

Altering the distribution of data

It has been demonstrated that changing the distribution of the targets, particularly in the case of regression, can have positive benefits in the performance of a learning algorithm (Andrews, D. F., et al. (1971)).

Here, we'll discuss one particularly useful transformation known as Quantile Transformation. This methodology aims to look at the data and manipulate it in such a way that its histogram follows either a normal distribution or a uniform distribution. It achieves this by looking at estimates of quantiles.

We can use the following commands to transform the same data as in the previous section:

from sklearn.preprocessing import QuantileTransformer
transformer = QuantileTransformer(output_distribution='normal')
df[[4,9]] = transformer.fit_transform(df[[4,9]])

This will effectively map the data into a new distribution, namely, a normal distribution.

Here, the term normal distribution refers to a Gaussian-like probability density function...

Data augmentation

Now that you have learned how to process the data to have specific distributions, it is important for you to know about data augmentation, which is usually associated with missing data or high-dimensional data. Traditional machine learning algorithms may have problems dealing with data where the number of dimensions surpasses the number of samples available. The problem is not particular to all deep learning algorithms, but some algorithms have a much more difficult time learning to model a problem that has more variables to figure out than samples to work on. We have a few options to correct that: either we reduce the dimensions or variables (see the following section) or we increase the samples in our dataset (this section).

One of the tools for adding more data is known as data augmentation (Van Dyk, D. A., and Meng, X. L. (2001)). In this section, we will use the MNIST dataset to exemplify a few techniques for data augmentation that are particular to images but...

Data dimensionality reduction

As pointed out before, if we have the problem of having more dimensions (or variables) than samples in our data, we can either augment the data or reduce the dimensionality of the data. Now, we will address the basics of the latter.

We will look into reducing dimensions both in supervised and unsupervised ways with both small and large datasets.

Supervised algorithms

Supervised algorithms for dimensionality reduction are so called because they take the labels of the data into account to find better representations. Such methods often yield good results. Perhaps the most popular kind is called linear discriminant analysis (LDA), which we'll discuss next.

Linear discriminant analysis

Scikit learn has a LinearDiscriminantAnalysis class that can easily perform dimensionality reduction on a desired number of components.

By number of components, the number of dimensions desired is understood. The name comes from principal component analysis (PCA), which is...

Ethical implications of manipulating data

There are many ethical implications and risks when manipulating data that you need to know. We live in a world where most deep learning algorithms will have to be corrected, by re-training them, because it was found that they were biased or unfair. That is very unfortunate; you want to be a person who exercises responsible AI and produces carefully thought out models.

When manipulating data, be careful about removing outliers from the data just because you think they are decreasing your model's performance. Sometimes, outliers represent information about protected groups or minorities, and removing those perpetuates unfairness and introduces bias toward the majority groups. Avoid removing outliers unless you are absolutely sure that they are errors caused by faulty sensors or human error.

Be careful of the way you transform the distribution of the data. Altering the distribution is fine in most cases, but if you are dealing with demographic...

Summary

In this chapter, we discussed many data manipulation techniques that we will come back to use all the time. It is good for you to spend time doing this now rather than later. It will make our modeling of deep learning architectures easier.

After reading this chapter, you are now able to manipulate and produce binary data for classification or for feature representation. You also know how to deal with categorical data and labels and prepare it for classification or regression. When you have real-valued data, you now know how to identify statistical properties and how to normalize such data. If you ever have the problem of data that has non-normal or non-uniform distributions, now you know how to fix that. And if you ever encounter problems of not having enough data, you learned a few data augmentation techniques. Toward the end of this chapter, you learned some of the most popular dimensionality reduction techniques. You will learn more of these along the road, for example, when...

Questions and answers

Which variables of the heart dataset are suitable for regression?

Actually, all of them. But the ideal ones are those that are real-valued.

Does the scaling of the data change the distribution of the data?

No. The distribution remains the same. Statistical metrics such as the mean and variance may change, but the distribution remains the same.

What is the main difference between supervised and unsupervised dimensionality reduction methods?

Supervised algorithms use the target labels, while unsupervised algorithms do not need that information.

When is it better to use batch-based dimensionality reduction?

When you have very large datasets.

References

Cleveland Heart Disease Dataset (1988). Principal investigators:
a. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
b. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
c. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
d. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.J., Sandhu, S., Guppy, K.H., Lee, S. and Froelicher, V., (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. The American journal of cardiology, 64(5), 304-310.
Deng, L. (2012). The MNIST database of handwritten digit images for machine learning research (best of the web). IEEE Signal Processing Magazine, 29(6), 141-142.
Sezgin, M., and Sankur, B. (2004). Survey over image thresholding techniques and quantitative performance evaluation. Journal of Electronic imaging, 13(1), 146-166.

Potdar...

The rest of the chapter is locked

You have been reading a chapter from

Deep Learning for Beginners

Published in: Sep 2020Publisher: PacktISBN-13: 9781838640859

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Dr. Pablo Rivas

Dr. Pablo Rivas is an assistant professor of computer science at Baylor University in Texas. He worked in industry for a decade as a software engineer before becoming an academic. He is a senior member of the IEEE, ACM, and SIAM. He was formerly at NASA Goddard Space Flight Center performing research. He is an ally of women in technology, a deep learning evangelist, machine learning ethicist, and a proponent of the democratization of machine learning and artificial intelligence in general. He teaches machine learning and deep learning. Dr. Rivas is a published author and all his papers are related to machine learning, computer vision, and machine learning ethics. Dr. Rivas prefers Vim to Emacs and spaces to tabs.
Read more about Dr. Pablo Rivas

Other recommended products

Related to this chapter

Machine Learning for Healthcare Analytics Projects

Machine Learning in the healthcare domain is booming because of its abilities to provide accurate and stabilized techniques. This book is packed with new methodologies to create efficient solutions for healthcare analytics. We will build five end-to-end projects to evaluate the efficiency of AI apps to carry out simple-to-complex healthcare analytics tasks.

BookOct 2018134 pages

Deep Learning with Hadoop

BookFeb 2017206 pages

Hands-On Deep Learning Algorithms with Python

This book introduces basic-to-advanced deep learning algorithms used in a production environment by AI researchers and principal data scientists; it explains algorithms intuitively, including the underlying math, and shows how to implement them using popular Python-based deep learning libraries such as TensorFlow.

BookJul 2019512 pages

Keras Deep Learning Cookbook

This book gives you a practical, hands-on understanding of how you can leverage the power of Python and Keras to perform effective deep learning. It presents a unique problem-solution approach to tackle various problems in training different types of neural networks while taking care of the speed and accuracy of these models

BookOct 2018252 pages

Python Deep Learning Cookbook

Deep Learning is a rapidly evolving field of Machine Learning science which gives machines the ability to learn from information. This book contains detailed recipes to tackle with the common and not so common problems while dealing with deep learning algorithms and models in Python. You will benefit from this book by finding technical solutions to the issues presented, along with a detailed explanation of the solutions, and a discussion on corresponding pros and cons of implementing the proposed solution using Theano, Tensorflow, MXNet, and Keras. You'll come across recipes on data pre-processing, network models and topologies, supervised and unsupervised learning presented in a “solution to problem” fashion.

BookOct 2017330 pages

TensorFlow 2.0 Computer Vision Cookbook

This book covers recipes for solving various computer vision tasks using TensorFlow, taking you through all the tips and tricks you need to overcome any challenges that you may face while building various computer vision applications. You will discover machine learning techniques to solve problems in image processing, feature extraction, and more.

BookFeb 2021542 pages

Neural Network Projects with Python

This book contains practical implementations of several deep learning projects in multiple domains, including in regression-based tasks such as taxi fare prediction in New York City, image classification of cats and dogs using a convolutional neural network, implementing a facial recognition security system using Siamese Neural Networks, and more.

BookFeb 2019308 pages

Advanced Deep Learning with TensorFlow 2 and Keras

A second edition of the bestselling guide to exploring and mastering deep learning with Keras, updated to include TensorFlow 2.x with new chapters on object detection, semantic segmentation, and unsupervised learning using mutual information.

BookFeb 2020512 pages

Advanced Deep Learning with R

This book will help readers to apply deep learning algorithms in R using advanced examples. You will cover variants of neural network models such as ANN, CNN, RNN, LSTM, and more using expert techniques. Readers will make use of popular deep learning libraries such as Keras-R, Tensorflow-R, and more to implement AI models.

BookDec 2019352 pages

Advanced Deep Learning with Keras

This book covers advanced deep learning techniques to create successful AI. Using MLPs, CNNs, and RNNs as building blocks to more advanced techniques, you’ll study deep neural network architectures, Autoencoders, Generative Adversarial Networks (GANs), Variational AutoEncoders (VAEs), and Deep Reinforcement Learning (DRL) critical to many cutting-edge AI results.

BookOct 2018368 pages

Deep Learning with Keras

Keras is a high-level neural network library written in Python that runs on top of either Theano or TensorFlow. With this book, you’ll learn the basics of Keras in a highly practical way and understand how this minimal, highly modular framework runs on both CPU and GPU, allowing you to put your ideas into action in the shortest possible time.

BookApr 2017318 pages

Hands-On Computer Vision with TensorFlow 2

Computer vision is achieving a new frontier of capabilities in fields like health, automobile or robotics. This book explores TensorFlow 2, Google's open-source AI framework, and teaches how to leverage deep neural networks for visual tasks. It will help you acquire the insight and skills to be a part of the exciting advances in computer vision.

BookMay 2019372 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages