Machine Learning Basics
Welcome to Hands-On Deep Learning with R! This book will take you through all of the steps that are necessary to code deep learning models using the R statistical programming language. It begins with simple examples as the first step for those just getting started, along with a review of the foundational elements of deep learning for those with more experience. As you progress through this book, you will learn how to code increasingly complex deep learning solutions for a wide variety of tasks. However, regardless of the complexity, each chapter will carefully detail each step. This is so that all topics and concepts can be fully comprehended and the reason for every line of code is completely explained.
In this chapter, we will go through a quick overview of the machine learning process as it will form a base for the subsequent chapters of this book.&...
An overview of machine learning
All deep learning is machine learning, but not all machine learning is deep learning. Throughout this book, we will focus on processes and techniques that are specific to deep learning in R. However, all the core principles of machine learning are essential to understand before we can move on to explore deep learning.
Deep learning is marked as a special subset of machine learning based on the use of neural networks that mimic brain activity behavior. The learning is referred to as being deep because, during the modeling process, the data is manipulated by a number of hidden layers. In this type of modeling, specific information is gathered from each layer. For example, one layer may find the edges of images while another finds particular hues.
Notable applications for this type of machine learning include the following:
- Image recognition...
Preparing data for modeling
One of the benefits of deep learning is that it largely removes the need for feature engineering, which you may be used to with machine learning. That being said, the data still needs to be prepared before we begin modeling. Let's review the following goals to prepare data for modeling:
- Remove no-information and extremely low-information variables
- Identify dates and extract date parts
- Handle missing values
- Handle outliers
In this chapter, we will be investigating air quality data using data provided by the London Air Quality Network. Specifically, we will look at readings for nitrogen dioxide in the area of Tower Hamlets (Mile End Road) during 2018. This is a very small dataset with only a few features and approximately 35,000 observations. We are using a limited dataset here so that all of our code, even our modeling, runs quickly. That said...
Training a model on prepared data
Now that the data is ready, we will split it into train and test sets and run a simple model. The objective at this point is not to try to achieve the best performance, but rather to get some type of a benchmark result to use in the future as we try to improve our model.
Train and test data
When we build predictive models, we need to create two separate sets of data with the help of the following segments. One is used by the model to learn the task and the other is used to test how well the model learned the task. Here are the types of data that we will look at:
- Train data: The segment of the data used to fit the model. The model has access to the explainer variables or independent variables...
Evaluating model results
We only know whether a model is successful if we can measure it, and it is worthwhile taking a moment to remember which metrics to use in which scenarios. Take, for example, a credit card fraud dataset where there is a large imbalance in the target variable because there will only be a, relatively, few cases of fraud among many non-fraudulent cases.
If we use a metric that just measures the percentage of the target variable that we predict successfully, then we will not be evaluating our model in a very helpful way. In this case, to keep the math simple, let's imagine we have 10,000 cases and only 10 of them are fraudulent accounts. If we predict that all cases are not fraudulent, then we will have 99.9% accuracy. This is very accurate, but it is not very helpful. Here is a review of the different metrics and when to use them.
Improving model results
Since we have a regression problem, we now know why we chose RMSE, and we have a baseline metric of performance, we can begin to work on improving our model. Every model will have its own different way of improving results; however, we can generalize slightly. Feature engineering helps to improve model performance; however, since this type of work is less important with deep learning, we will not focus on that here. Also, we have already used feature engineering to generate our date and time parts. In addition, we can run our model for longer at a slower learning rate and we can tune hyperparameters. In order to find the best values using this type of model improvement method, we will use a technique called grid search to look at a range of values for a number of different fields.
Let's search for the optimal number of rounds. Using the cross-validation...
Reviewing different algorithms
We have raced through machine learning relatively quickly, as we wanted to focus on the underlying concepts that will follow along with us as we head into deep learning. As such, we cannot offer a comprehensive explanation of all machine learning techniques; however, we will quickly review the different algorithm types here, as this will be helpful to remember going forward.
We'll do a quick review of the following machine learning algorithms:
- Decision Trees: A decision tree is a simple model that makes up the base learners of many more complex algorithms. A decision tree simply splits a dataset at a given variable and notes the proportion of the target class that exists in the splits. For example, if we were to predict who is more likely to enjoy playing with baby toys, then a split on age would likely show that the split of the data...
Summary
In this chapter, we referred to a raw dataset, explored the data, and took the necessary preprocessing steps to get the data ready for modeling. We performed data type transformations to convert numbers and dates being stored as character strings into numeric and date value columns, respectively. In addition, we performed some feature engineering by breaking up the date value into its component parts. After completing preprocessing, we modeled our data. We followed an approach that included creating a baseline model and then tuning hyperparameters to improve our initial score. We used early stopping rounds and grid searches to identify hyperparameter values that produced the best results. After modifying our model-based results from our tuning procedures, we noticed much better performance.
All of the aspects of machine learning that were discussed in this chapter will...