You're reading from Machine Learning for Developers

Product type Book

Published in Oct 2017

Publisher Packt

ISBN-13 9781786469878

Pages 270 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Authors (2):

Rodolfo Bonnin

Md Mahmudul Hasan

View More author details

The Learning Process

In the first chapter, we saw a general overview of the mathematical concepts, history, and areas of the field of machine learning.

As this book intends to provide a practical but formally correct way of learning, now it's time to explore the general thought process for any machine learning process. These concepts will be pervasive throughout the chapters and will help us to define a common framework of the best practices of the field.

The topics we will cover in this chapter are as follows:

Understanding the problem and definitions
Dataset retrieval, preprocessing, and feature engineering
Model definition, training, and evaluation
Understanding results and metrics

Every machine learning problem tends to have its own particularities. Nevertheless, as the discipline advances through time, there are emerging patterns of what kind of steps a machine learning...

Understanding the problem

When solving machine learning problems, it's important to take time to analyze both the data and the possible amount of work beforehand. This preliminary step is flexible and less formal than all the subsequent ones on this list.

From the definition of machine learning, we know that our final goal is to make the computer learn or generalize a certain behavior or model from a sample set of data. So, the first thing we should do is understand the new capabilities we want to learn.

In the enterprise field, this is the time to have more practical discussions and brainstorms. The main questions we could ask ourselves during this phase could be as follows:

What is the real problem we are trying to solve?
What is the current information pipeline?
How can I streamline data acquisition?
Is the incoming data complete, or does it have gaps?
What additional...

Dataset definition and retrieval

Once we have identified the data sources, the next task is to gather all the tuples or records as a homogeneous set. The format can be a tabular arrangement, a series of real values (such as audio or weather variables), and N-dimensional matrices (a set of images or cloud points), among other types.

The ETL process

The previous stages in the big data processing field evolved over several decades under the name of data mining, and then adopted the popular name of big data.

One of the best outcomes of these disciplines is the specification of the Extraction, Transform, Load (ETL) process.

This process starts with a mix of many data sources from business systems, then moves to a system that transforms...

Feature engineering

Feature engineering is in some ways one of the most underrated parts of the machine learning process, even though it is considered the cornerstone of the learning process by many prominent figures of the community.

What's the purpose of this process? In short, it takes the raw data from databases, sensors, archives, and so on, and transforms it in a way that makes it easy for the model to generalize. This discipline takes criteria from many sources, including common sense. It's indeed more like an art than a rigid science. It is a manual process, even when some parts of it can be automatized via a group of techniques grouped in the feature extraction field.

As part of this process we also have many powerful mathematical tools and dimensionality reduction techniques, such as Principal Component Analysis (PCA) and Autoencoders, that allow data scientists...

Dataset preprocessing

When we first dive into data science, a common mistake is expecting all the data to be very polished and with good characteristics from the very beginning. Alas, that is not the case for a very considerable percentage of cases, for many reasons such as null data, sensor errors that cause outliers and NAN, faulty registers, instrument-induced bias, and all kinds of defects that lead to poor model fitting and that must be eradicated.

The two key processes in this stage are data normalization and feature scaling. This process consists of applying simple transformations called affine that map the current unbalanced data into a more manageable shape, maintaining its integrity but providing better stochastic properties and improving the future applied model. The common goal of the standardization techniques is to bring the data distribution closer to a normal distribution...

Model definition

If we wanted to summarize the machine learning process using just one word, it would certainly be models. This is because what we build with machine learning are abstractions or models representing and simplifying reality, allowing us to solve real-life problems based on a model that we have trained on.

The task of choosing which model to use is becoming increasingly difficult, given the increasing number of models appearing almost every day, but you can make general approximations by grouping methods by the type of task you want to perform and also the type of input data, so that the problem is simplified to a smaller set of options.

Asking ourselves the right questions

At the risk of generalizing too much...

Loss function definition

This machine learning process step is also very important because it provides a distinctive measure of the quality of your model, and if wrongly chosen, it could either ruin the accuracy of the model or its efficiency in the speed of convergence.

Expressed in a simple way, the loss function is a function that measures the distance from the model's estimated value to the real expected value.

An important fact that we have to take into account is that the objective of almost all of the models is to minimize the error function, and for this, we need it to be differentiable, and the derivative of the error function should be as simple as possible.

Another fact is that when the model gets increasingly complex, the derivative of the error will also get more complex, so we will need to approximate solutions for the derivatives with iterative methods...

Model fitting and evaluation

In this part of the machine learning process, we have the model and data ready, and we proceed to train and validate our model.

Dataset partitioning

At the time of training the models, we usually partition all the provided data into three sets: the training set, which will actually be used to adjust the parameters of the models; the validation set, which will be used to compare alternative models applied to that data (it can be ignored if we have just one model and architecture in mind); and the test set, which will be used to measure the accuracy of the chosen model. The proportions of these partitions are normally 70/20/10.

...

Model implementation and results interpretation

No model is practical if it can't be used outside the training and test sets. This is when the model is deployed into production.

In this stage, we normally load all the model's operation and trained weights, wait for new unknown data, and when it arrives, we feed it through all the chained functions of the model, informing the outcomes of the output layer or operation via a web service, printing to standard output, and so on.

Then, we will have a final task - to interpret the results of the model in the real world to constantly check whether it works in the current conditions. In the case of generative models, the suitability of the predictions is easier to understand because the goal is normally the representation of a previously known entity.

...

Summary

In this chapter, we reviewed all the main steps involved in a machine learning process. We will be, indirectly, using them throughout the book, and we hope they help you structure your future work too.

In the next chapter, we will review the programming languages and frameworks that we will be using to solve all our machine learning problems and become proficient with them before starting with the projects.

References

Lichman, M. (2013). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.
Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
Townsend, James T. Theoretical analysis of an alphabetic confusion matrix. Attention, Perception, & Psychophysics 9.1 (1971): 40-50.
Peter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics 20: 53-65.
Kent, Allen, et al, Machine literature searching VIII. Operational criteria for designing information retrieval systems. Journal of the Association for Information Science and Technology 6.2 (1955): 93...