Reader small image

You're reading from  Regression Analysis with R

Product typeBook
Published inJan 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788627306
Edition1st Edition
Languages
Right arrow
Author (1)
Giuseppe Ciaburro
Giuseppe Ciaburro
author image
Giuseppe Ciaburro

Giuseppe Ciaburro holds a PhD and two master's degrees. He works at the Built Environment Control Laboratory - Università degli Studi della Campania "Luigi Vanvitelli". He has over 25 years of work experience in programming, first in the field of combustion and then in acoustics and noise control. His core programming knowledge is in MATLAB, Python and R. As an expert in AI applications to acoustics and noise control problems, Giuseppe has wide experience in researching and teaching. He has several publications to his credit: monographs, scientific journals, and thematic conferences. He was recently included in the world's top 2% scientists list by Stanford University (2022).
Read more about Giuseppe Ciaburro

Right arrow

Chapter 6. Avoiding Overfitting Problems - Achieving Generalization

In the previous chapters, we have emphasized the importance of the training phase for successful modeling. In the training phase, the model is developed by accurately specifying the level of detail that the system will be able to predict. The higher the degree of detail required, the greater the ability to predict from the model. So far, nothing strange has been found. Problems arise when we use that model to make new predictions based on data that the model does not know. The risk we run is that we push the precision in the details so much that we lose the ability to generalize.

Let's consider a practical example: suppose we build a face recognition model. Since each pixel can be compared between one image and the other, it may happen that minor details become overwhelming: hair, background, shirt color, and so on. The number of details on which the model can play is so wide that it is able to identify individual images...

Understanding overfitting


General overfitting occurs when a very complex statistical model suits the observed data because it has too many parameters compared to the number of observations. The risk is that an incorrect model can perfectly fit data, just because it is quite complex compared to the amount of data available. Although, it is possible for overfitting to occur when the amount of data is adequate. Consequently, when the model is used to predict new observations, there is a problem, because it is not able to generalize.

The concept of overfitting is also very important in regression analysis. Usually, a learning algorithm is trained using a set of examples (training set), the output of which is already known. It is assumed that the learning algorithm will reach a state in which it will be able to predict outputs for all the other examples it has not yet seen, assuming that the learning model will be able to generalize.

However, especially in cases where there is a small number of...

Feature selection


In general, when we work with high-dimensional datasets, it is a good idea to reduce the number of features to only the most useful ones and discard the rest. This can lead to simpler models that generalize better. Feature selection is the process of reducing inputs for processing and analyzing or identifying the most significant features over the others. This selection of features is necessary to create a functional model, so as to achieve a reduction in cardinality, imposing a limit greater than the number of features that must be considered during its creation. In the following figure, a general scheme of a feature selection process is shown:

Usually, the data contains redundant information, or more than the necessary information; in other cases, it may contain incorrect information. Feature selection makes the process of creating a model more efficient, for example, decreasing the load on the CPU and the memory needed to train the algorithm. Moreover, selection of features...

Regularization


As an alternative to the selection methods discussed in the previous sections (forward, backward, stepwise), it is possible to adopt methods that use all predictors but bind or adjust the coefficients by bringing them to very small or zero values (shrinkage). These methods are actually defined as automatic feature selection methods, as they improve generalization. They are called regularization methods and involve modifying the performance function, normally selected as the sum of the squares of regression errors on the training set.

When a large number of variables are available, the least square estimates of a linear model often have a low bias but a high variance with respect to models with fewer variables. Under these conditions, as we have seen in previous sections, there is an overfitting problem. To improve precision prediction by allowing greater bias but a small variance, we can use variable selection methods and dimensionality reduction, but these methods may be unattractive...

Summary


In this chapter, we learned how to achieve generalization for our models. We explored several techniques for avoiding overfitting and creating models with low bias and variance. In the beginning, differences between overfitting and underfitting were explained.

In general, overfitting occurs when a very complex statistical model suits the observed data because it has too many parameters compared to the number of observations. The risk is that an incorrect model can perfectly fit data just because it is quite complex compared to the amount of data available. Consequently, when the model is used to predict new observations, there is a failure, because it is not able to generalize. On the contrary, underfitting occurs when a regression algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to nonlinear data. Such a model would have poor predictive performance.

We then discovered the cross-validation procedure through...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Regression Analysis with R
Published in: Jan 2018Publisher: PacktISBN-13: 9781788627306
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Giuseppe Ciaburro

Giuseppe Ciaburro holds a PhD and two master's degrees. He works at the Built Environment Control Laboratory - Università degli Studi della Campania "Luigi Vanvitelli". He has over 25 years of work experience in programming, first in the field of combustion and then in acoustics and noise control. His core programming knowledge is in MATLAB, Python and R. As an expert in AI applications to acoustics and noise control problems, Giuseppe has wide experience in researching and teaching. He has several publications to his credit: monographs, scientific journals, and thematic conferences. He was recently included in the world's top 2% scientists list by Stanford University (2022).
Read more about Giuseppe Ciaburro