Packt+ | Advance your knowledge in tech

You're reading from Regression Analysis with Python

Product typeBook

Published inFeb 2016

Reading LevelIntermediate

Publisher

ISBN-139781785286315

Edition1st Edition

Languages

Python

Tools

SciPy Scikit-learn

Concepts

Statistics

Authors (2):

Luca Massaron

Alberto Boschetti

View More author details

Chapter 5. Data Preparation

After providing solid foundations for an understanding of the two basic linear models for regression and classification, we devote this chapter to a discussion about the data feeding the model. In the next pages, we will describe what can routinely be done to prepare the data in the best way and how to deal with more challenging situations, such as when data is missing or outliers are present.

Real-world experiments produce real data, which, in contrast to synthetic or simulated data, is often very varied. Real data is also quite messy, and frequently it proves wrong in ways that are obvious and some that are, initially, quite subtle. As a data practitioner, you will almost never find your data already prepared in the right form to be immediately analyzed for your purposes.

Writing a compendium of bad data and its remedies is outside the scope of this book, but our intention is to provide you with the basics to help you manage the majority of common data problems...

Numeric feature scaling

In Chapter 3, Multiple Regression in Action, inside the feature scaling section, we discussed how changing your original variables to a similar scale could help better interpret the resulting regression coefficients. Moreover, scaling is essential when using gradient descent-based algorithms because it facilitates quicker converging to a solution. For gradient descent, we will introduce other techniques that can only work using scaled features. However, apart for the technical requirements of certain algorithms, now our intention is to draw your attention to how feature scaling can be helpful when working with data that can sometimes be missing or faulty.

Missing or wrong data can happen not just during training but also during the production phase. Now, if a missing value is encountered, you have two design options to create a model sufficiently robust to cope with such a problem:

Actively deal with the missing values (there is a paragraph in this chapter devoted to...

Qualitative feature encoding

Beyond numeric features, which have been the main topic of this section so far, a great part of your data will also comprise qualitative variables. Databases especially tend to record data readable and understandable by human beings; consequently, they are quite crowded by qualitative data, which can appear in data fields in the form of text or just single labels explicating information, such as telling you the class of an observation or some of its characteristics.

For a better understanding of qualitative variables, a working example could be a weather dataset. Such a dataset describes conditions under which you would want to play tennis because of weather information such as outlook, temperature, humidity, and wind, which are all kinds of information that can be rendered by numeric measurements. However, you will easily find such data online and recorded in datasets with their qualitative translations such as sunny or overcast, rather than numeric satellite...

Numeric feature transformation

Numeric features can be transformed, regardless of the target variable. This is often a prerequisite for better performance of certain classifiers, particularly distance-based. We usually avoid ( besides specific cases such as when modeling a percentage or distributions with long queues) transforming the target, since we will make any pre-existent linear relationship between the target and other features non-linear.

We will keep on working on the Boston Housing dataset:

In: import numpy as np
  boston = load_boston()
  labels = boston.feature_names
  X = boston.data
  y = boston.target
  print (boston.feature_names)

Out: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' \'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']

As before, we fit the model using LinearRegression from Scikit-learn, this time measuring its R-squared value using the r2_score function from the metrics module:

In: linear_regression = \linear_model.LinearRegression(fit_intercept=True)
  linear_regression...

Missing data

Missing data appears often in real-life data, sometimes randomly in random occurrences, more often because of some bias in its recording and treatment. All linear models work on complete numeric matrices and cannot deal directly with such problems; consequently, it is up to you to take care of feeding suitable data for the algorithm to process.

Even if your initial dataset does not present any missing data, it is still possible to encounter missing values in the production phase. In such a case, the best strategy is surely that of dealing with them passively, as presented at the beginning of the chapter, by standardizing all the numeric variables.

Tip

As for as indicator variables, in order to passively intercept missing values, a possible strategy is instead to encode the presence of the label as 1 and its absence as -1, leaving the zero value for missing values.

When missing values are present from the beginning of the project, it is certainly better to deal with them explicitly...

Outliers

After properly transforming all the quantitative and qualitative variables and fixing any missing data, what's left is just to detect any possible outlier and to deal with it by removing it from the data or by imputing it as if it were a missing case.

An outlier, sometimes also referred to as an anomaly, is an observation that is very different from all the others you have observed so far. It can be viewed as an unusual case that stands out, and it could pop up due to a mistake (an erroneous value completely out of scale) or simply a value that occurred (rarely, but it occurred). Though understanding the origin of an outlier could help to fix the problem in the most appropriate way (an error could be legitimately removed; a rare case could be kept or capped or even imputed as a missing case), what is of utmost concern is the effect of one or more outliers on your regression analysis results. Any anomalous data in a regression analysis means a distortion of the regression's coefficients...

Summary

In this chapter, we have dealt with many different problems that you may encounter when preparing your data to be analyzed by a linear model.

We started by discussing rescaling variables and understanding how new variables' scales not only permit a better insight into the data, but also help us deal with unexpectedly missing data.

Then, we learned how to encode qualitative variables and deal with the extreme variety of possible levels with unpredictable variables and textual information just by using the hashing trick. We then returned to quantitative variables and learned how to transform in a linear shape and obtain better regression models.

Finally, we dealt with some possible data pathologies, missing and outlying values, showing a few quick fixes that, in spite of their simplicity, are extremely effective and performant.

At this point, before proceeding to more sophisticated linear models, we just need to illustrate the data science principles that can help you obtain really good...

The rest of the chapter is locked

You have been reading a chapter from

Regression Analysis with Python

Published in: Feb 2016Publisher: ISBN-13: 9781785286315

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Authors (2)

Luca Massaron

Having joined Kaggle over 10 years ago, Luca Massaron is a Kaggle Grandmaster in discussions and a Kaggle Master in competitions and notebooks. In Kaggle competitions he reached no. 7 in the worldwide rankings. On the professional side, Luca is a data scientist with more than a decade of experience in transforming data into smarter artifacts, solving real-world problems, and generating value for businesses and stakeholders. He is a Google Developer Expert(GDE) in machine learning and the author of best-selling books on AI, machine learning, and algorithms.
Read more about Luca Massaron

Alberto Boschetti

Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP) and behavioral analysis to machine learning and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.
Read more about Alberto Boschetti

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages