Reader small image

You're reading from  scikit-learn Cookbook - Second Edition

Product typeBook
Published inNov 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787286382
Edition2nd Edition
Languages
Right arrow
Author (1)
Trent Hauck
Trent Hauck
author image
Trent Hauck

Trent Hauck is a data scientist living and working in the Seattle area. He grew up in Wichita, Kansas and received his undergraduate and graduate degrees from the University of Kansas. He is the author of the book Instant Data Intensive Apps with pandas How-to, Packt Publishing—a book that can get you up to speed quickly with pandas and other associated technologies.
Read more about Trent Hauck

Right arrow

Pre-Model Workflow and Pre-Processing

In this chapter we will see the following recipes:

  • Creating sample data for toy analysis
  • Scaling data to the standard normal distribution
  • Creating binary features through thresholding
  • Working with categorical variables
  • Imputing missing values through various strategies
  • A linear model in the presence of outliers
  • Putting it all together with pipelines
  • Using Gaussian processes for regression
  • Using SGD for regression

Introduction

What is data, and what are we doing with it?

A simple answer is that we attempt to place our data as points on paper, graph them, think, and look for simple explanations that approximate the data well. The simple geometric line of F=ma (force being proportional to acceleration) explained a lot of noisy data for hundreds of years. I tend to think of data science as data compression at times.

Sometimes, when a machine is given only win-lose outcomes (of winning games of checkers, for example) and trained, I think of artificial intelligence. It is never taught explicit directions on how to play to win in such a case.

This chapter deals with the pre-processing of data in scikit-learn. Some questions you can ask about your dataset are as follows:

  • Are there missing values in your dataset?
  • Are there outliers (points far away from the others) in your set?
  • What are the variables...

Creating sample data for toy analysis

If possible, use some of your own data for this book, but in the event you cannot, we'll learn how we can use scikit-learn to create toy data. scikit-learn's pseudo, theoretically constructed data is very interesting in its own right.

Getting ready

Very similar to getting built-in datasets, fetching new datasets, and creating sample datasets, the functions that are used follow the naming convention make_*. Just to be clear, this data is purely artificial:

from sklearn import datasets
datasets.make_*?

datasets.make_biclusters
datasets.make_blobs
datasets.make_checkerboard
datasets.make_circles
datasets.make_classification
...

To save typing, import the datasets module as d, and numpy...

Scaling data to the standard normal distribution

A pre-processing step that is recommended is to scale columns to the standard normal. The standard normal is probably the most important distribution in statistics. If you've ever been introduced to statistics, you must have almost certainly seen z-scores. In truth, that's all this recipe is about—transforming our features from their endowed distribution into z-scores.

Getting ready

The act of scaling data is extremely useful. There are a lot of machine learning algorithms, which perform differently (and incorrectly) in the event the features exist at different scales. For example, SVMs perform poorly if the data isn't scaled because they use a distance...

Creating binary features through thresholding

In the last recipe, we looked at transforming our data into the standard normal distribution. Now, we'll talk about another transformation, one that is quite different. Instead of working with the distribution to standardize it, we'll purposely throw away data; if we have good reason, this can be a very smart move. Often, in what is ostensibly continuous data, there are discontinuities that can be determined via binary features.

Additionally, note that in the previous chapter, we turned a classification problem into a regression problem. With thresholding, we can turn a regression problem into a classification problem. This happens in some data science contexts.

Getting ready

...

Working with categorical variables

Categorical variables are a problem. On one hand they provide valuable information; on the other hand, it's probably text—either the actual text or integers corresponding to the text—such as an index in a lookup table.

So, we clearly need to represent our text as integers for the model's sake, but we can't just use the id field or naively represent them. This is because we need to avoid a similar problem to the Creating binary features through thresholding recipe. If we treat data that is continuous, it must be interpreted as continuous.

Getting ready

The Boston dataset won't be useful for this section. While it's useful for feature binarization, it...

Imputing missing values through various strategies

Data imputation is critical in practice, and thankfully there are many ways to deal with it. In this recipe, we'll look at a few of the strategies. However, be aware that there might be other approaches that fit your situation better.

This means scikit-learn comes with the ability to perform fairly common imputations; it will simply apply some transformations to the existing data and fill the NAs. However, if the dataset is missing data, and there's a known reason for this missing data—for example, response times for a server that times out after 100 ms—it might be better to take a statistical approach through other packages, such as the Bayesian treatment via PyMC, hazards models via Lifelines, or something home-grown.

...

A linear model in the presence of outliers

In this recipe, instead of traditional linear regression we will try using the Theil-Sen estimator to deal with some outliers.

Getting ready

First, create the data corresponding to a line with a slope of 2:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

num_points = 100
x_vals = np.arange(num_points)
y_truth = 2 * x_vals
plt.plot(x_vals, y_truth)

Add noise to that data and label it as y_noisy:

y_noisy = y_truth.copy()
#Change y-values of some points in the line
y_noisy[20:40] = y_noisy[20:40] * (-4 * x_vals[20:40]) - 100

plt.title("Noise in y-direction")
plt.xlim([0,100])
plt.scatter(x_vals, y_noisy,marker='x')
...

Putting it all together with pipelines

Now that we've used pipelines and data transformation techniques, we'll walk through a more complicated example that combines several of the previous recipes into a pipeline.

Getting ready

In this section, we'll show off some more of pipeline's power. When we used it earlier to impute missing values, it was only a quick taste; here, we'll chain together multiple pre-processing steps to show how pipelines can remove extra work.

Let's briefly load the iris dataset and seed it with some missing values:

from sklearn.datasets import load_iris
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
iris_data = iris.data
mask = np.random.binomial...

Using Gaussian processes for regression

In this recipe, we'll use a Gaussian process for regression. In the linear models section, we will see how representing prior information on the coefficients was possible using Bayesian ridge regression.

With a Gaussian process, it's about the variance and not the mean. However, with a Gaussian process, we assume the mean is 0, so it's the covariance function we'll need to specify.

The basic setup is similar to how a prior can be put on the coefficients in a typical regression problem. With a Gaussian process, a prior can be put on the functional form of the data, and it's the covariance between the data points that is used to model the data, and therefore, must fit the data.

A big advantage of Gaussian processes is that they can predict probabilistically: you can obtain confidence bounds on your predictions. Additionally...

Using SGD for regression

In this recipe, we'll get our first taste of stochastic gradient descent. We'll use it for regression here.

Getting ready

SGD is often an unsung hero in machine learning. Underneath many algorithms, there is SGD doing the work. It's popular due to its simplicity and speed—these are both very good things to have when dealing with a lot of data. The other nice thing about SGD is that while it's at the core of many machine learning algorithms computationally, it does so because it easily describes the process. At the end of the day, we apply some transformation on the data, and then we fit our data to the model with a loss function.

...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
scikit-learn Cookbook - Second Edition
Published in: Nov 2017Publisher: PacktISBN-13: 9781787286382
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Trent Hauck

Trent Hauck is a data scientist living and working in the Seattle area. He grew up in Wichita, Kansas and received his undergraduate and graduate degrees from the University of Kansas. He is the author of the book Instant Data Intensive Apps with pandas How-to, Packt Publishing—a book that can get you up to speed quickly with pandas and other associated technologies.
Read more about Trent Hauck