Packt+ | Advance your knowledge in tech

You're reading from Regression Analysis with R

Product typeBook

Published inJan 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781788627306

Edition1st Edition

Languages

Concepts

Predictive Analytics

Author (1)

Giuseppe Ciaburro

Chapter 6. Avoiding Overfitting Problems - Achieving Generalization

In the previous chapters, we have emphasized the importance of the training phase for successful modeling. In the training phase, the model is developed by accurately specifying the level of detail that the system will be able to predict. The higher the degree of detail required, the greater the ability to predict from the model. So far, nothing strange has been found. Problems arise when we use that model to make new predictions based on data that the model does not know. The risk we run is that we push the precision in the details so much that we lose the ability to generalize.

Let's consider a practical example: suppose we build a face recognition model. Since each pixel can be compared between one image and the other, it may happen that minor details become overwhelming: hair, background, shirt color, and so on. The number of details on which the model can play is so wide that it is able to identify individual images...

Understanding overfitting

General overfitting occurs when a very complex statistical model suits the observed data because it has too many parameters compared to the number of observations. The risk is that an incorrect model can perfectly fit data, just because it is quite complex compared to the amount of data available. Although, it is possible for overfitting to occur when the amount of data is adequate. Consequently, when the model is used to predict new observations, there is a problem, because it is not able to generalize.

The concept of overfitting is also very important in regression analysis. Usually, a learning algorithm is trained using a set of examples (training set), the output of which is already known. It is assumed that the learning algorithm will reach a state in which it will be able to predict outputs for all the other examples it has not yet seen, assuming that the learning model will be able to generalize.

However, especially in cases where there is a small number of...

Feature selection

In general, when we work with high-dimensional datasets, it is a good idea to reduce the number of features to only the most useful ones and discard the rest. This can lead to simpler models that generalize better. Feature selection is the process of reducing inputs for processing and analyzing or identifying the most significant features over the others. This selection of features is necessary to create a functional model, so as to achieve a reduction in cardinality, imposing a limit greater than the number of features that must be considered during its creation. In the following figure, a general scheme of a feature selection process is shown:

Usually, the data contains redundant information, or more than the necessary information; in other cases, it may contain incorrect information. Feature selection makes the process of creating a model more efficient, for example, decreasing the load on the CPU and the memory needed to train the algorithm. Moreover, selection of features...

Regularization

As an alternative to the selection methods discussed in the previous sections (forward, backward, stepwise), it is possible to adopt methods that use all predictors but bind or adjust the coefficients by bringing them to very small or zero values (shrinkage). These methods are actually defined as automatic feature selection methods, as they improve generalization. They are called regularization methods and involve modifying the performance function, normally selected as the sum of the squares of regression errors on the training set.

When a large number of variables are available, the least square estimates of a linear model often have a low bias but a high variance with respect to models with fewer variables. Under these conditions, as we have seen in previous sections, there is an overfitting problem. To improve precision prediction by allowing greater bias but a small variance, we can use variable selection methods and dimensionality reduction, but these methods may be unattractive...

Summary

In this chapter, we learned how to achieve generalization for our models. We explored several techniques for avoiding overfitting and creating models with low bias and variance. In the beginning, differences between overfitting and underfitting were explained.

In general, overfitting occurs when a very complex statistical model suits the observed data because it has too many parameters compared to the number of observations. The risk is that an incorrect model can perfectly fit data just because it is quite complex compared to the amount of data available. Consequently, when the model is used to predict new observations, there is a failure, because it is not able to generalize. On the contrary, underfitting occurs when a regression algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to nonlinear data. Such a model would have poor predictive performance.

We then discovered the cross-validation procedure through...

The rest of the chapter is locked

You have been reading a chapter from

Regression Analysis with R

Published in: Jan 2018Publisher: PacktISBN-13: 9781788627306

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Giuseppe Ciaburro

Giuseppe Ciaburro holds a PhD and two master's degrees. He works at the Built Environment Control Laboratory - Università degli Studi della Campania "Luigi Vanvitelli". He has over 25 years of work experience in programming, first in the field of combustion and then in acoustics and noise control. His core programming knowledge is in MATLAB, Python and R. As an expert in AI applications to acoustics and noise control problems, Giuseppe has wide experience in researching and teaching. He has several publications to his credit: monographs, scientific journals, and thematic conferences. He was recently included in the world's top 2% scientists list by Stanford University (2022).
Read more about Giuseppe Ciaburro

Other recommended products

Related to this chapter

Neural Networks with R

The book helps you learn neural networks and implement them in R. It covers real-world use cases that will help you better understand their concepts. A basic understanding of R and mathematics is required.

BookSep 2017270 pages

MATLAB for Machine Learning

MATLAB is the language of choice for many researchers and mathematics experts for machine learning. This book will build a foundation for machine learning using MATLAB for beginners. It will also help you learn regression, clustering, classification, predictive analytics, artificial neural networks, and more with MATLAB.

BookAug 2017382 pages

Hands-On Exploratory Data Analysis with R

Hands-On Exploratory Data Analysis with R puts the complete process of exploratory data analysis into a practical demonstration in one nutshell. You will understand the concepts of data analysis right from data ingestion, data cleaning, data manipulation to applying statistical techniques and visualizing hidden patterns.

BookMay 2019266 pages

Keras 2.x Projects

Keras is a deep learning library that enables the fast, efficient training of deep learning models. The book begins with setting up the environment, training various types of models in the domain of deep learning and reinforcement learning. The projects are exciting and are real-world market demanding projects which take you from simple to complex level.

BookDec 2018394 pages

Practical Machine Learning Cookbook

Machine learning is the new BLACK GOLD. In this book, we explore topics such as classification, clustering, model selection and regularization, nonlinearity, supervised, unsupervised, and reinforcement learning, structured prediction, neural networks, deep learning, and case studies. The algorithms are developed using R.The book is for students and professionals in the field of statistics, data analytics, and computer science.

BookApr 2017570 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Practical Machine Learning with R

Practical Machine Learning with R gives you the complete knowledge to solve your business problems - starting by forming a good problem statement, selecting the most appropriate model to solve your problem, and then ensuring that you do not overtrain the model.

BookAug 2019416 pages

Hands-On Machine Learning on Google Cloud Platform

In this book, you will learn how to create powerful machine learning based applications for a wide variety of problems leveraging different data services from the Google Cloud Platform. Finally, you will know the main difficulties that you may encounter and get appropriate strategies to overcome these difficulties and build efficient systems.

BookApr 2018500 pages

Applied Supervised Learning with R

Applied Supervised Learning with R will make you a pro at identifying your business problem, selecting the best supervised machine learning algorithm to solve it, and fine-tuning your model to exactly deliver your needs without overfitting itself.

BookMay 2019502 pages

Statistical Application Development with R and Python

Statistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions. You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python. By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.

BookAug 2017432 pages

Mastering Machine Learning with R

Machine learning is the field of Artificial Intelligence where we build systems that learn from data. Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start applying machine learning to your data. This book will teach you advanced techniques in machine learning with the latest code in R 3.3.2.

BookApr 2017420 pages

Jupyter for Data Science

Jupyter Notebook is a web-based environment that enables interactive computing in notebook documents. It allows you to create documents that contain live code, equations and visualizations. This book will be a comprehensive guide to getting started with data science using the popular Jupyter notebook. It will show you how to leverage the capabilties of Jupyter to perform various data science tasks efficiently. From data exploration to visualization, this book will take you through every step of the way in implementing an effective data science pipeline using Jupyter.

BookOct 2017242 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages