Packt+ | Advance your knowledge in tech

You're reading from Regression Analysis with Python

Product typeBook

Published inFeb 2016

Reading LevelIntermediate

Publisher

ISBN-139781785286315

Edition1st Edition

Languages

Python

Tools

SciPy Scikit-learn

Concepts

Statistics

Authors (2):

Luca Massaron

Alberto Boschetti

View More author details

Chapter 6. Achieving Generalization

We have to confess that, until this point, we've delayed the crucial moment of truth when our linear model has to be put to the test and verified as effectively predicting its target. Up to now, we have just considered whether we were doing a good modeling job by naively looking at a series of good-fit measures, all just telling us if the linear model could be apt at predicting based solely on the information in our training data.

Unless you love sink-or-swim situations, in much the same procedure you'd employ with new software before going into production, you need to apply the correct tests to your model and to be able to anticipate its live performance.

Moreover, no matter your level of skill and experience with such types of models, you can easily be misled into thinking you're building a good model just on the basis of the same data you used to define it. We will therefore introduce you to the fundamental distinction between in-sample and out-of-sample...

Checking on out-of-sample data

Until this point in the book, we have striven to make the regression model fit data, even by modifying the data itself (inputting missing data, removing outliers, transforming for non-linearity, or creating new features). By keeping an eye on measures such as R-squared, we have tried our best to reduce prediction errors, though we have no idea to what extent this was successful.

The problem we face now is that we shouldn't expect a well fit model to automatically perform well on any new data during production.

While defining and explaining the problem, we recall what we said about underfitting. Since we are working with a linear model, we are actually expecting to apply our work to data that has a linear relationship with the response variable. Having a linear relationship means that, with respect to the level of the response variable, our predictors always tend to constantly increase (or decrease) at the same rate. Graphically, on a scatterplot, this is refigured...

Greedy selection of features

By following our experiments throughout the book, you may have noticed that adding new variables is always a great success in a linear regression model. That's especially true for training errors and it happens not just when we insert the right variables but also when we place the wrong ones. Puzzlingly, when we add redundant or non-useful variables, there is always a more or less positive impact on the fit of the model.

The reason is easily explained; since regression models are high-bias models, they find it beneficial to augment their complexity by increasing the number of coefficients they use. Thus, some of the new coefficients can be used to fit the noise and other details present in data. It is precisely the memorization/overfitting effect we discussed before. When you have as many coefficients as observations, your model can become saturated (that's the technical term used in statistics) and you could have a perfect prediction because basically you have...

Regularization optimized by grid-search

Regularization is another way to modify the role of variables in a regression model to prevent overfitting and to achieve simpler functional forms. The interesting aspect of this alternative approach is that it actually doesn't require manipulating your original dataset, making it suitable for systems that learn and predict online from large amounts of features and observations, without human intervention. Regularization works by enriching the learning process using a penalization for too complex models to shrink (or reduce to zero) coefficients relative to variables that are irrelevant for your prediction term or are redundant, as they are highly correlated with others present in the model (the collinearity problem seen before).

Ridge (L2 regularization)

The idea behind ridge regression is simple and straightforward: if the problem is the presence of many variables, which affect the regression model because of their coefficient, all we have to do is...

Stability selection

As presented, L1-penalty offers the advantage of rendering your coefficients' estimates sparse, and effectively it acts as a variable selector since it tends to leave only essential variables in the model. On the other hand, the selection itself tends to be unstable when data changes and it requires a certain effort to correctly tune the C parameter to make the selection most effective. As we have seen while discussing elastic net, the peculiarity resides in the behavior of Lasso when there are two highly correlated variables; depending on the structure of the data (noise and correlation with other variables), L1 regularization will choose just one of the two.

In the field of studies related to bioinformatics (DNA, molecular studies), it is common to work with a large number of variables based on a few observations. Typically, such problems are denominated p >> n (features are much more numerous than cases) and they present the necessity to select what features to...

Summary

During this chapter, we have covered quite a lot of ground, finally exploring the most experimental and scientific part of the task of modeling linear regression or classification models.

Starting with the topic of generalization, we explained what can go wrong in a model and why it is always important to check the true performances of your work by train/test splits and by bootstraps and cross-validation (though we recommend using the latter more for validation work than general evaluation itself).

Model complexity as a source of variance in the estimate gave us the occasion to introduce variable selection, first by greedy selection of features, univariate or multivariate, then using regularization techniques, such as Ridge, Lasso and Elastic Net.

Finally, we demonstrated a powerful application of Lasso, called stability selection, which, in the light of our experience, we recommend you try for many feature selection problems.

In the next chapter, we will deal with the problem of incrementally...

The rest of the chapter is locked

You have been reading a chapter from

Regression Analysis with Python

Published in: Feb 2016Publisher: ISBN-13: 9781785286315

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Authors (2)

Luca Massaron

Having joined Kaggle over 10 years ago, Luca Massaron is a Kaggle Grandmaster in discussions and a Kaggle Master in competitions and notebooks. In Kaggle competitions he reached no. 7 in the worldwide rankings. On the professional side, Luca is a data scientist with more than a decade of experience in transforming data into smarter artifacts, solving real-world problems, and generating value for businesses and stakeholders. He is a Google Developer Expert(GDE) in machine learning and the author of best-selling books on AI, machine learning, and algorithms.
Read more about Luca Massaron

Alberto Boschetti

Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP) and behavioral analysis to machine learning and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.
Read more about Alberto Boschetti

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages