Reader small image

You're reading from  Regression Analysis with Python

Product typeBook
Published inFeb 2016
Reading LevelIntermediate
Publisher
ISBN-139781785286315
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Luca Massaron
Luca Massaron
author image
Luca Massaron

Having joined Kaggle over 10 years ago, Luca Massaron is a Kaggle Grandmaster in discussions and a Kaggle Master in competitions and notebooks. In Kaggle competitions he reached no. 7 in the worldwide rankings. On the professional side, Luca is a data scientist with more than a decade of experience in transforming data into smarter artifacts, solving real-world problems, and generating value for businesses and stakeholders. He is a Google Developer Expert(GDE) in machine learning and the author of best-selling books on AI, machine learning, and algorithms.
Read more about Luca Massaron

Alberto Boschetti
Alberto Boschetti
author image
Alberto Boschetti

Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP) and behavioral analysis to machine learning and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.
Read more about Alberto Boschetti

View More author details
Right arrow

Chapter 5. Data Preparation

After providing solid foundations for an understanding of the two basic linear models for regression and classification, we devote this chapter to a discussion about the data feeding the model. In the next pages, we will describe what can routinely be done to prepare the data in the best way and how to deal with more challenging situations, such as when data is missing or outliers are present.

Real-world experiments produce real data, which, in contrast to synthetic or simulated data, is often very varied. Real data is also quite messy, and frequently it proves wrong in ways that are obvious and some that are, initially, quite subtle. As a data practitioner, you will almost never find your data already prepared in the right form to be immediately analyzed for your purposes.

Writing a compendium of bad data and its remedies is outside the scope of this book, but our intention is to provide you with the basics to help you manage the majority of common data problems...

Numeric feature scaling


In Chapter 3, Multiple Regression in Action, inside the feature scaling section, we discussed how changing your original variables to a similar scale could help better interpret the resulting regression coefficients. Moreover, scaling is essential when using gradient descent-based algorithms because it facilitates quicker converging to a solution. For gradient descent, we will introduce other techniques that can only work using scaled features. However, apart for the technical requirements of certain algorithms, now our intention is to draw your attention to how feature scaling can be helpful when working with data that can sometimes be missing or faulty.

Missing or wrong data can happen not just during training but also during the production phase. Now, if a missing value is encountered, you have two design options to create a model sufficiently robust to cope with such a problem:

  • Actively deal with the missing values (there is a paragraph in this chapter devoted to...

Qualitative feature encoding


Beyond numeric features, which have been the main topic of this section so far, a great part of your data will also comprise qualitative variables. Databases especially tend to record data readable and understandable by human beings; consequently, they are quite crowded by qualitative data, which can appear in data fields in the form of text or just single labels explicating information, such as telling you the class of an observation or some of its characteristics.

For a better understanding of qualitative variables, a working example could be a weather dataset. Such a dataset describes conditions under which you would want to play tennis because of weather information such as outlook, temperature, humidity, and wind, which are all kinds of information that can be rendered by numeric measurements. However, you will easily find such data online and recorded in datasets with their qualitative translations such as sunny or overcast, rather than numeric satellite...

Numeric feature transformation


Numeric features can be transformed, regardless of the target variable. This is often a prerequisite for better performance of certain classifiers, particularly distance-based. We usually avoid ( besides specific cases such as when modeling a percentage or distributions with long queues) transforming the target, since we will make any pre-existent linear relationship between the target and other features non-linear.

We will keep on working on the Boston Housing dataset:

In: import numpy as np
  boston = load_boston()
  labels = boston.feature_names
  X = boston.data
  y = boston.target
  print (boston.feature_names)

Out: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' \'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']

As before, we fit the model using LinearRegression from Scikit-learn, this time measuring its R-squared value using the r2_score function from the metrics module:

In: linear_regression = \linear_model.LinearRegression(fit_intercept=True)
  linear_regression...

Missing data


Missing data appears often in real-life data, sometimes randomly in random occurrences, more often because of some bias in its recording and treatment. All linear models work on complete numeric matrices and cannot deal directly with such problems; consequently, it is up to you to take care of feeding suitable data for the algorithm to process.

Even if your initial dataset does not present any missing data, it is still possible to encounter missing values in the production phase. In such a case, the best strategy is surely that of dealing with them passively, as presented at the beginning of the chapter, by standardizing all the numeric variables.

Tip

As for as indicator variables, in order to passively intercept missing values, a possible strategy is instead to encode the presence of the label as 1 and its absence as -1, leaving the zero value for missing values.

When missing values are present from the beginning of the project, it is certainly better to deal with them explicitly...

Outliers


After properly transforming all the quantitative and qualitative variables and fixing any missing data, what's left is just to detect any possible outlier and to deal with it by removing it from the data or by imputing it as if it were a missing case.

An outlier, sometimes also referred to as an anomaly, is an observation that is very different from all the others you have observed so far. It can be viewed as an unusual case that stands out, and it could pop up due to a mistake (an erroneous value completely out of scale) or simply a value that occurred (rarely, but it occurred). Though understanding the origin of an outlier could help to fix the problem in the most appropriate way (an error could be legitimately removed; a rare case could be kept or capped or even imputed as a missing case), what is of utmost concern is the effect of one or more outliers on your regression analysis results. Any anomalous data in a regression analysis means a distortion of the regression's coefficients...

Summary


In this chapter, we have dealt with many different problems that you may encounter when preparing your data to be analyzed by a linear model.

We started by discussing rescaling variables and understanding how new variables' scales not only permit a better insight into the data, but also help us deal with unexpectedly missing data.

Then, we learned how to encode qualitative variables and deal with the extreme variety of possible levels with unpredictable variables and textual information just by using the hashing trick. We then returned to quantitative variables and learned how to transform in a linear shape and obtain better regression models.

Finally, we dealt with some possible data pathologies, missing and outlying values, showing a few quick fixes that, in spite of their simplicity, are extremely effective and performant.

At this point, before proceeding to more sophisticated linear models, we just need to illustrate the data science principles that can help you obtain really good...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Regression Analysis with Python
Published in: Feb 2016Publisher: ISBN-13: 9781785286315
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (2)

author image
Luca Massaron

Having joined Kaggle over 10 years ago, Luca Massaron is a Kaggle Grandmaster in discussions and a Kaggle Master in competitions and notebooks. In Kaggle competitions he reached no. 7 in the worldwide rankings. On the professional side, Luca is a data scientist with more than a decade of experience in transforming data into smarter artifacts, solving real-world problems, and generating value for businesses and stakeholders. He is a Google Developer Expert(GDE) in machine learning and the author of best-selling books on AI, machine learning, and algorithms.
Read more about Luca Massaron

author image
Alberto Boschetti

Alberto Boschetti is a data scientist with expertise in signal processing and statistics. He holds a Ph.D. in telecommunication engineering and currently lives and works in London. In his work projects, he faces challenges ranging from natural language processing (NLP) and behavioral analysis to machine learning and distributed processing. He is very passionate about his job and always tries to stay updated about the latest developments in data science technologies, attending meet-ups, conferences, and other events.
Read more about Alberto Boschetti