Search icon CANCEL
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Learning Hub
Free Learning
Arrow right icon
Python Machine Learning Cookbook
Python Machine Learning Cookbook

Python Machine Learning Cookbook: 100 recipes that teach you how to perform various machine learning tasks in the real world

By Prateek Joshi , Vahid Mirjalili
$51.99 $35.99
Book Jun 2016 304 pages 1st Edition
$51.99 $35.99
$15.99 Monthly
$51.99 $35.99
$15.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
Table of content icon View table of contents Preview book icon Preview Book

Python Machine Learning Cookbook

Chapter 1. The Realm of Supervised Learning

In this chapter, we will cover the following recipes:

  • Preprocessing data using different techniques

  • Label encoding

  • Building a linear regressor

  • Computing regression accuracy

  • Achieving model persistence

  • Building a ridge regressor

  • Building a polynomial regressor

  • Estimating housing prices

  • Computing the relative importance of features

  • Estimating bicycle demand distribution


If you are familiar with the basics of machine learning, you will certainly know what supervised learning is all about. To give you a quick refresher, supervised learning refers to building a machine learning model that is based on labeled samples. For example, if we build a system to estimate the price of a house based on various parameters, such as size, locality, and so on, we first need to create a database and label it. We need to tell our algorithm what parameters correspond to what prices. Based on this data, our algorithm will learn how to calculate the price of a house using the input parameters.

Unsupervised learning is the opposite of what we just discussed. There is no labeled data available here. Let's assume that we have a bunch of datapoints, and we just want to separate them into multiple groups. We don't exactly know what the criteria of separation would be. So, an unsupervised learning algorithm will try to separate the given dataset into a fixed number of groups in the best possible way. We will discuss unsupervised learning in the upcoming chapters.

We will use various Python packages, such as NumPy, SciPy, scikit-learn, and matplotlib, during the course of this book to build various things. If you use Windows, it is recommended that you use a SciPy-stack compatible version of Python. You can check the list of compatible versions at These distributions come with all the necessary packages already installed. If you use Mac OS X or Ubuntu, installing these packages is fairly straightforward. Here are some useful links for installation and documentation:

Make sure that you have these packages installed on your machine before you proceed.

Preprocessing data using different techniques

In the real world, we usually have to deal with a lot of raw data. This raw data is not readily ingestible by machine learning algorithms. To prepare the data for machine learning, we have to preprocess it before we feed it into various algorithms.

Getting ready

Let's see how to preprocess data in Python. To start off, open a file with a .py extension, for example,, in your favorite text editor. Add the following lines to this file:

import numpy as np
from sklearn import preprocessing

We just imported a couple of necessary packages. Let's create some sample data. Add the following line to this file:

data = np.array([[3, -1.5,  2, -5.4], [0,  4,  -0.3, 2.1], [1,  3.3, -1.9, -4.3]])

We are now ready to operate on this data.

How to do it…

Data can be preprocessed in many ways. We will discuss a few of the most commonly-used preprocessing techniques.

Mean removal

It's usually beneficial to remove the mean from each feature so that it's centered on zero. This helps us in removing any bias from the features. Add the following lines to the file that we opened earlier:

data_standardized = preprocessing.scale(data)
print "\nMean =", data_standardized.mean(axis=0)
print "Std deviation =", data_standardized.std(axis=0)

We are now ready to run the code. To do this, run the following command on your Terminal:

$ python

You will see the following output on your Terminal:

Mean = [  5.55111512e-17  -1.11022302e-16  -7.40148683e-17  -7.40148683e-17]
Std deviation = [ 1.  1.  1.  1.]

You can see that the mean is almost 0 and the standard deviation is 1.


The values of each feature in a datapoint can vary between random values. So, sometimes it is important to scale them so that this becomes a level playing field. Add the following lines to the file and run the code:

data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled = data_scaler.fit_transform(data)
print "\nMin max scaled data =", data_scaled

After scaling, all the feature values range between the specified values. The output will be displayed, as follows:

Min max scaled data: 
[[ 1.          0.          1.          0.        ]
 [ 0.          1.          0.41025641  1.        ]
 [ 0.33333333  0.87272727  0.          0.14666667]]


Data normalization is used when you want to adjust the values in the feature vector so that they can be measured on a common scale. One of the most common forms of normalization that is used in machine learning adjusts the values of a feature vector so that they sum up to 1. Add the following lines to the previous file:

data_normalized = preprocessing.normalize(data, norm='l1')
print "\nL1 normalized data =", data_normalized

If you run the Python file, you will get the following output:

L1 normalized data: 
[[ 0.25210084 -0.12605042  0.16806723 -0.45378151]
 [ 0.          0.625      -0.046875    0.328125  ]
 [ 0.0952381   0.31428571 -0.18095238 -0.40952381]]

This is used a lot to make sure that datapoints don't get boosted artificially due to the fundamental nature of their features.


Binarization is used when you want to convert your numerical feature vector into a Boolean vector. Add the following lines to the Python file:

data_binarized = preprocessing.Binarizer(threshold=1.4).transform(data)
print "\nBinarized data =", data_binarized

Run the code again, and you will see the following output:

Binarized data:
[[ 1.  0.  1.  0.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  0.]]

This is a very useful technique that's usually used when we have some prior knowledge of the data.

One Hot Encoding

A lot of times, we deal with numerical values that are sparse and scattered all over the place. We don't really need to store these big values. This is where One Hot Encoding comes into picture. We can think of One Hot Encoding as a tool to tighten the feature vector. It looks at each feature and identifies the total number of distinct values. It uses a one-of-k scheme to encode the values. Each feature in the feature vector is encoded based on this. This helps us be more efficient in terms of space. For example, let's say we are dealing with 4-dimensional feature vectors. To encode the n-th feature in a feature vector, the encoder will go through the n-th feature in each feature vector and count the number of distinct values. If the number of distinct values is k, it will transform the feature into a k-dimensional vector where only one value is 1 and all other values are 0. Add the following lines to the Python file:

encoder = preprocessing.OneHotEncoder()[[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4, 3]])
encoded_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print "\nEncoded vector =", encoded_vector

This is the expected output:

Encoded vector:
[[ 0.  0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]

In the above example, let's consider the third feature in each feature vector. The values are 1, 5, 2, and 4. There are four distinct values here, which means the one-hot encoded vector will be of length 4. If you want to encode the value 5, it will be a vector [0, 1, 0, 0]. Only one value can be 1 in this vector. The second element is 1, which indicates that the value is 5.

Label encoding

In supervised learning, we usually deal with a variety of labels. These can be in the form of numbers or words. If they are numbers, then the algorithm can use them directly. However, a lot of times, labels need to be in human readable form. So, people usually label the training data with words. Label encoding refers to transforming the word labels into numerical form so that the algorithms can understand how to operate on them. Let's take a look at how to do this.

How to do it…

  1. Create a new Python file, and import the preprocessing package:

    from sklearn import preprocessing
  2. This package contains various functions that are needed for data preprocessing. Let's define the label encoder, as follows:

    label_encoder = preprocessing.LabelEncoder()
  3. The label_encoder object knows how to understand word labels. Let's create some labels:

    input_classes = ['audi', 'ford', 'audi', 'toyota', 'ford', 'bmw']
  4. We are now ready to encode these labels:
    print "\nClass mapping:"
    for i, item in enumerate(label_encoder.classes_):
        print item, '-->', i
  5. Run the code, and you will see the following output on your Terminal:

    Class mapping:
    audi --> 0
    bmw --> 1
    ford --> 2
    toyota --> 3
  6. As shown in the preceding output, the words have been transformed into 0-indexed numbers. Now, when you encounter a set of labels, you can simply transform them, as follows:

    labels = ['toyota', 'ford', 'audi']
    encoded_labels = label_encoder.transform(labels)
    print "\nLabels =", labels 
    print "Encoded labels =", list(encoded_labels)

    Here is the output that you'll see on your Terminal:

    Labels = ['toyota', 'ford', 'audi']
    Encoded labels = [3, 2, 0]
  7. This is way easier than manually maintaining mapping between words and numbers. You can check the correctness by transforming numbers back to word labels:

    encoded_labels = [2, 1, 0, 3, 1]
    decoded_labels = label_encoder.inverse_transform(encoded_labels)
    print "\nEncoded labels =", encoded_labels
    print "Decoded labels =", list(decoded_labels)

    Here is the output:

    Encoded labels = [2, 1, 0, 3, 1]
    Decoded labels = ['ford', 'bmw', 'audi', 'toyota', 'bmw']

    As you can see, the mapping is preserved perfectly.

Building a linear regressor

Regression is the process of estimating the relationship between input data and the continuous-valued output data. This data is usually in the form of real numbers, and our goal is to estimate the underlying function that governs the mapping from the input to the output. Let's start with a very simple example. Consider the following mapping between input and output:

1 --> 2

3 --> 6

4.3 --> 8.6

7.1 --> 14.2

If I ask you to estimate the relationship between the inputs and the outputs, you can easily do this by analyzing the pattern. We can see that the output is twice the input value in each case, so the transformation would be as follows:

f(x) = 2x

This is a simple function, relating the input values with the output values. However, in the real world, this is usually not the case. Functions in the real world are not so straightforward!

Getting ready

Linear regression refers to estimating the underlying function using a linear combination of input variables. The preceding example was an example that consisted of one input variable and one output variable.

Consider the following figure:

The goal of linear regression is to extract the underlying linear model that relates the input variable to the output variable. This aims to minimize the sum of squares of differences between the actual output and the predicted output using a linear function. This method is called Ordinary least squares.

You might say that there might be a curvy line out there that fits these points better, but linear regression doesn't allow this. The main advantage of linear regression is that it's not complex. If you go into nonlinear regression, you may get more accurate models, but they will be slower. As shown in the preceding figure, the model tries to approximate the input datapoints using a straight line. Let's see how to build a linear regression model in Python.

How to do it…

You have been provided with a data file, called data_singlevar.txt. This contains comma-separated lines where the first element is the input value and the second element is the output value that corresponds to this input value. You should use this as the input argument:

  1. Create a file called, and add the following lines:

    import sys
    import numpy as np
    filename = sys.argv[1]
    X = []
    y = []
    with open(filename, 'r') as f:
        for line in f.readlines():
            xt, yt = [float(i) for i in line.split(',')]

    We just loaded the input data into X and y, where X refers to data and y refers to labels. Inside the loop in the preceding code, we parse each line and split it based on the comma operator. We then convert it into floating point values and save it in X and y, respectively.

  2. When we build a machine learning model, we need a way to validate our model and check whether the model is performing at a satisfactory level. To do this, we need to separate our data into two groups: a training dataset and a testing dataset. The training dataset will be used to build the model, and the testing dataset will be used to see how this trained model performs on unknown data. So, let's go ahead and split this data into training and testing datasets:

    num_training = int(0.8 * len(X))
    num_test = len(X) - num_training
    # Training data
    X_train = np.array(X[:num_training]).reshape((num_training,1))
    y_train = np.array(y[:num_training])
    # Test data
    X_test = np.array(X[num_training:]).reshape((num_test,1))
    y_test = np.array(y[num_training:])

    Here, we will use 80% of the data for the training dataset and the remaining 20% for the testing dataset.

  3. We are now ready to train the model. Let's create a regressor object, as follows:

    from sklearn import linear_model
    # Create linear regression object
    linear_regressor = linear_model.LinearRegression()
    # Train the model using the training sets, y_train)
  4. We just trained the linear regressor, based on our training data. The fit method takes the input data and trains the model. Let's see how it fits:

    import matplotlib.pyplot as plt
    y_train_pred = linear_regressor.predict(X_train)
    plt.scatter(X_train, y_train, color='green')
    plt.plot(X_train, y_train_pred, color='black', linewidth=4)
    plt.title('Training data')
  5. We are now ready to run the code using the following command:

    $ python data_singlevar.txt

    You should see the following figure:

  6. In the preceding code, we used the trained model to predict the output for our training data. This wouldn't tell us how the model performs on unknown data because we are running it on training data itself. This just gives us an idea of how the model fits on training data. Looks like it's doing okay as you can see in the preceding figure!

  7. Let's predict the test dataset output based on this model and plot it, as follows:

    y_test_pred = linear_regressor.predict(X_test)
    plt.scatter(X_test, y_test, color='green')
    plt.plot(X_test, y_test_pred, color='black', linewidth=4)
    plt.title('Test data')

    If you run this code, you will see a graph like the following one:

Computing regression accuracy

Now that we know how to build a regressor, it's important to understand how to evaluate the quality of a regressor as well. In this context, an error is defined as the difference between the actual value and the value that is predicted by the regressor.

Getting ready

Let's quickly understand what metrics can be used to measure the quality of a regressor. A regressor can be evaluated using many different metrics, such as the following:

  • Mean absolute error: This is the average of absolute errors of all the datapoints in the given dataset.

  • Mean squared error: This is the average of the squares of the errors of all the datapoints in the given dataset. It is one of the most popular metrics out there!

  • Median absolute error: This is the median of all the errors in the given dataset. The main advantage of this metric is that it's robust to outliers. A single bad point in the test dataset wouldn't skew the entire error metric, as opposed to a mean error metric.

  • Explained variance score: This score measures how well our model can account for the variation in our dataset. A score of 1.0 indicates that our model is perfect.

  • R2 score: This is pronounced as R-squared, and this score refers to the coefficient of determination. This tells us how well the unknown samples will be predicted by our model. The best possible score is 1.0, and the values can be negative as well.

How to do it…

There is a module in scikit-learn that provides functionalities to compute all the following metrics. Open a new Python file and add the following lines:

import sklearn.metrics as sm

print "Mean absolute error =", round(sm.mean_absolute_error(y_test, y_test_pred), 2) 
print "Mean squared error =", round(sm.mean_squared_error(y_test, y_test_pred), 2) 
print "Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred), 2) 
print "Explained variance score =", round(sm.explained_variance_score(y_test, y_test_pred), 2) 
print "R2 score =", round(sm.r2_score(y_test, y_test_pred), 2)

Keeping track of every single metric can get tedious, so we pick one or two metrics to evaluate our model. A good practice is to make sure that the mean squared error is low and the explained variance score is high.

Achieving model persistence

When we train a model, it would be nice if we could save it as a file so that it can be used later by simply loading it again.

How to do it…

Let's see how to achieve model persistence programmatically:

  1. Add the following lines to

    import cPickle as pickle
    output_model_file = 'saved_model.pkl'
    with open(output_model_file, 'w') as f:
        pickle.dump(linear_regressor, f)
  2. The regressor object will be saved in the saved_model.pkl file. Let's look at how to load it and use it, as follows:

    with open(output_model_file, 'r') as f:
        model_linregr = pickle.load(f)
    y_test_pred_new = model_linregr.predict(X_test)
    print "\nNew mean absolute error =", round(sm.mean_absolute_error(y_test, y_test_pred_new), 2)
  3. Here, we just loaded the regressor from the file into the model_linregr variable. You can compare the preceding result with the earlier result to confirm that it's the same.

Building a ridge regressor

One of the main problems of linear regression is that it's sensitive to outliers. During data collection in the real world, it's quite common to wrongly measure the output. Linear regression uses ordinary least squares, which tries to minimize the squares of errors. The outliers tend to cause problems because they contribute a lot to the overall error. This tends to disrupt the entire model.

Getting ready

Let's consider the following figure:

The two points on the bottom are clearly outliers, but this model is trying to fit all the points. Hence, the overall model tends to be inaccurate. By visual inspection, we can see that the following figure is a better model:

Ordinary least squares considers every single datapoint when it's building the model. Hence, the actual model ends up looking like the dotted line as shown in the preceding figure. We can clearly see that this model is suboptimal. To avoid this, we use regularization where a penalty is imposed on the size of the coefficients. This method is called Ridge Regression.

How to do it…

Let's see how to build a ridge regressor in Python:

  1. You can load the data from the data_multi_variable.txt file. This file contains multiple values in each line. All the values except the last value form the input feature vector.

  2. Add the following lines to Let's initialize a ridge regressor with some parameters:

    ridge_regressor = linear_model.Ridge(alpha=0.01, fit_intercept=True, max_iter=10000)
  3. The alpha parameter controls the complexity. As alpha gets closer to 0, the ridge regressor tends to become more like a linear regressor with ordinary least squares. So, if you want to make it robust against outliers, you need to assign a higher value to alpha. We considered a value of 0.01, which is moderate.

  4. Let's train this regressor, as follows:, y_train)
    y_test_pred_ridge = ridge_regressor.predict(X_test)
    print "Mean absolute error =", round(sm.mean_absolute_error(y_test, y_test_pred_ridge), 2) 
    print "Mean squared error =", round(sm.mean_squared_error(y_test, y_test_pred_ridge), 2) 
    print "Median absolute error =", round(sm.median_absolute_error(y_test, y_test_pred_ridge), 2) 
    print "Explain variance score =", round(sm.explained_variance_score(y_test, y_test_pred_ridge), 2) 
    print "R2 score =", round(sm.r2_score(y_test, y_test_pred_ridge), 2)

Run this code to view the error metrics. You can build a linear regressor to compare and contrast the results on the same data to see the effect of introducing regularization into the model.

Building a polynomial regressor

One of the main constraints of a linear regression model is the fact that it tries to fit a linear function to the input data. The polynomial regression model overcomes this issue by allowing the function to be a polynomial, thereby increasing the accuracy of the model.

Getting ready

Let's consider the following figure:

We can see that there is a natural curve to the pattern of datapoints. This linear model is unable to capture this. Let's see what a polynomial model would look like:

The dotted line represents the linear regression model, and the solid line represents the polynomial regression model. The curviness of this model is controlled by the degree of the polynomial. As the curviness of the model increases, it gets more accurate. However, curviness adds complexity to the model as well, hence, making it slower. This is a trade off where you have to decide between how accurate you want your model to be given the computational constraints.

How to do it…

  1. Add the following lines to

    from sklearn.preprocessing import PolynomialFeatures
    polynomial = PolynomialFeatures(degree=3)
  2. We initialized a polynomial of the degree 3 in the previous line. Now we have to represent the datapoints in terms of the coefficients of the polynomial:

    X_train_transformed = polynomial.fit_transform(X_train)

    Here, X_train_transformed represents the same input in the polynomial form.

  3. Let's consider the first datapoint in our file and check whether it can predict the right output:

    datapoint = [0.39,2.78,7.11]
    poly_datapoint = polynomial.fit_transform(datapoint)
    poly_linear_model = linear_model.LinearRegression(), y_train)
    print "\nLinear regression:", linear_regressor.predict(datapoint)[0]
    print "\nPolynomial regression:", poly_linear_model.predict(poly_datapoint)[0]

    The values in the variable datapoint are the values in the first line in the input data file. We are still fitting a linear regression model here. The only difference is in the way in which we represent the data. If you run this code, you will see the following output:

    Linear regression: -11.0587294983
    Polynomial regression: -10.9480782122

    As you can see, this is close to the output value. If we want it to get closer, we need to increase the degree of the polynomial.

  4. Let's make it 10 and see what happens:

    polynomial = PolynomialFeatures(degree=10)

    You should see something like the following:

    Polynomial regression: -8.20472183853

Now, you can see that the predicted value is much closer to the actual output value.

Estimating housing prices

It's time to apply our knowledge to a real world problem. Let's apply all these principles to estimate the housing prices. This is one of the most popular examples that is used to understand regression, and it serves as a good entry point. This is intuitive and relatable, hence making it easier to understand concepts before we perform more complex things in machine learning. We will use a decision tree regressor with AdaBoost to solve this problem.

Getting ready

A decision tree is a tree where each node makes a simple decision that contributes to the final output. The leaf nodes represent the output values, and the branches represent the intermediate decisions that were made, based on input features. AdaBoost stands for Adaptive Boosting, and this is a technique that is used to boost the accuracy of the results from another system. This combines the outputs from different versions of the algorithms, called weak learners, using a weighted summation to get the final output. The information that's collected at each stage of the AdaBoost algorithm is fed back into the system so that the learners at the latter stages focus on training samples that are difficult to classify. This is the way it increases the accuracy of the system.

Using AdaBoost, we fit a regressor on the dataset. We compute the error and then fit the regressor on the same dataset again, based on this error estimate. We can think of this as fine-tuning of the regressor until the desired accuracy is achieved. You are given a dataset that contains various parameters that affect the price of a house. Our goal is to estimate the relationship between these parameters and the house price so that we can use this to estimate the price given unknown input parameters.

How to do it…

  1. Create a new file called, and add the following lines:

    import numpy as np
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.ensemble import AdaBoostRegressor
    from sklearn import datasets
    from sklearn.metrics import mean_squared_error, explained_variance_score
    from sklearn.utils import shuffle
    import matplotlib.pyplot as plt
  2. There is a standard housing dataset that people tend to use to get started with machine learning. You can download it at The good thing is that scikit-learn provides a function to directly load this dataset:

    housing_data = datasets.load_boston() 

    Each datapoint has 13 input parameters that affect the price of the house. You can access the input data using and the corresponding price using

  3. Let's separate this into input and output. To make this independent of the ordering of the data, let's shuffle it as well:

    X, y = shuffle(,, random_state=7)
  4. The random_state parameter controls how we shuffle the data so that we can have reproducible results. Let's divide the data into training and testing. We'll allocate 80% for training and 20% for testing:

    num_training = int(0.8 * len(X))
    X_train, y_train = X[:num_training], y[:num_training]
    X_test, y_test = X[num_training:], y[num_training:]
  5. We are now ready to fit a decision tree regression model. Let's pick a tree with a maximum depth of 4, which means that we are not letting the tree become arbitrarily deep:

    dt_regressor = DecisionTreeRegressor(max_depth=4), y_train)
  6. Let's also fit decision tree regression model with AdaBoost:

    ab_regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=400, random_state=7), y_train)

    This will help us compare the results and see how AdaBoost really boosts the performance of a decision tree regressor.

  7. Let's evaluate the performance of decision tree regressor:

    y_pred_dt = dt_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred_dt)
    evs = explained_variance_score(y_test, y_pred_dt) 
    print "\n#### Decision Tree performance ####"
    print "Mean squared error =", round(mse, 2)
    print "Explained variance score =", round(evs, 2)
  8. Now, let's evaluate the performance of AdaBoost:

    y_pred_ab = ab_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred_ab)
    evs = explained_variance_score(y_test, y_pred_ab) 
    print "\n#### AdaBoost performance ####"
    print "Mean squared error =", round(mse, 2)
    print "Explained variance score =", round(evs, 2)

Here is the output on the Terminal:

#### Decision Tree performance ####
Mean squared error = 14.79
Explained variance score = 0.82

#### AdaBoost performance ####
Mean squared error = 7.54
Explained variance score = 0.91

The error is lower and the variance score is closer to 1 when we use AdaBoost as shown in the preceding output.

Computing the relative importance of features

Are all the features equally important? In this case, we used 13 input features, and they all contributed to the model. However, an important question here is, "How do we know which features are more important?" Obviously, all the features don't contribute equally to the output. In case we want to discard some of them later, we need to know which features are less important. We have this functionality available in scikit-learn.

How to do it…

  1. Let's plot the relative importance of the features. Add the following lines to

            'Decision Tree regressor', housing_data.feature_names)
            'AdaBoost regressor', housing_data.feature_names)

    The regressor object has a callable feature_importances_ method that gives us the relative importance of each feature.

  2. We actually need to define our plot_feature_ importances function to plot the bar graphs:

    def plot_feature_importances(feature_importances, title, feature_names):
        # Normalize the importance values 
        feature_importances = 100.0 * (feature_importances / max(feature_importances))
        # Sort the index values and flip them so that they are arranged in decreasing order of importance
        index_sorted = np.flipud(np.argsort(feature_importances))
        # Center the location of the labels on the X-axis (for display purposes only)
        pos = np.arange(index_sorted.shape[0]) + 0.5
        # Plot the bar graph
        plt.figure(), feature_importances[index_sorted], align='center')
        plt.xticks(pos, feature_names[index_sorted])
        plt.ylabel('Relative Importance')
  3. We just take the values from the feature_importances_ method and scale it so that it ranges between 0 and 100. If you run the preceding code, you will see two figures. Let's see what we will get for a decision tree-based regressor in the following figure:

  4. So, the decision tree regressor says that the most important feature is RM. Let's take a look at what AdaBoost has to say in the following figure:

According to AdaBoost, the most important feature is LSTAT. In reality, if you build various regressors on this data, you will see that the most important feature is in fact LSTAT. This shows the advantage of using AdaBoost with a decision tree-based regressor.

Estimating bicycle demand distribution

Let's use a different regression method to solve the bicycle demand distribution problem. We will use the random forest regressor to estimate the output values. A random forest is a collection of decision trees. This basically uses a set of decision trees that are built using various subsets of the dataset, and then it uses averaging to improve the overall performance.

Getting ready

We will use the bike_day.csv file that is provided to you. This is also available at There are 16 columns in this dataset. The first two columns correspond to the serial number and the actual date, so we won't use them for our analysis. The last three columns correspond to different types of outputs. The last column is just the sum of the values in the fourteenth and fifteenth columns, so we can leave these two out when we build our model.

How to do it…

Let's go ahead and see how to do this in Python. You have been provided with a file called that contains the full code. We will discuss the important parts of this, as follows:

  1. We first need to import a couple of new packages, as follows:

    import csv
    from sklearn.ensemble import RandomForestRegressor
    from housing import plot_feature_importances
  2. We are processing a CSV file, so the CSV package is useful in handling these files. As it's a new dataset, we will have to define our own dataset loading function:

    def load_dataset(filename):
        file_reader = csv.reader(open(filename, 'rb'), delimiter=',')
        X, y = [], []
        for row in file_reader:
        # Extract feature names
        feature_names = np.array(X[0])
        # Remove the first row because they are feature names
        return np.array(X[1:]).astype(np.float32), np.array(y[1:]).astype(np.float32), feature_names

    In this function, we just read all the data from the CSV file. The feature names are useful when we display it on a graph. We separate the data from the output values and return them.

  3. Let's read the data and shuffle it to make it independent of the order in which the data is arranged in the file:

    X, y, feature_names = load_dataset(sys.argv[1])
    X, y = shuffle(X, y, random_state=7)  
  4. As we did earlier, we need to separate the data into training and testing. This time, let's use 90% of the data for training and the remaining 10% for testing:

    num_training = int(0.9 * len(X))
    X_train, y_train = X[:num_training], y[:num_training]
    X_test, y_test = X[num_training:], y[num_training:]
  5. Let's go ahead and train the regressor:

    rf_regressor = RandomForestRegressor(n_estimators=1000, max_depth=10, min_samples_split=1), y_train)

    Here, n_estimators refers to the number of estimators, which is the number of decision trees that we want to use in our random forest. The max_depth parameter refers to the maximum depth of each tree, and the min_samples_split parameter refers to the number of data samples that are needed to split a node in the tree.

  6. Let's evaluate performance of the random forest regressor:

    y_pred = rf_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    evs = explained_variance_score(y_test, y_pred) 
    print "\n#### Random Forest regressor performance ####"
    print "Mean squared error =", round(mse, 2)
    print "Explained variance score =", round(evs, 2)
  7. As we already have the function to plot the importances feature, let's just call it directly:

    plot_feature_importances(rf_regressor.feature_importances_, 'Random Forest regressor', feature_names)

    Once you run this code, you will see the following graph:

Looks like the temperature is the most important factor controlling the bicycle rentals.

There's more…

Let's see what happens when you include fourteenth and fifteenth columns in the dataset. In the feature importance graph, every feature other than these two has to go to zero. The reason is that the output can be obtained by simply summing up the fourteenth and fifteenth columns, so the algorithm doesn't need any other features to compute the output. In the load_dataset function, make the following change inside the for loop:


If you plot the feature importance graph now, you will see the following:

As expected, it says that only these two features are important. This makes sense intuitively because the final output is a simple summation of these two features. So, there is a direct relationship between these two variables and the output value. Hence, the regressor says that it doesn't need any other variable to predict the output. This is an extremely useful tool to eliminate redundant variables in your dataset.

There is another file called bike_hour.csv that contains data about how the bicycles are shared hourly. We need to consider columns 3 to 14, so let's make this change inside the load_dataset function:


If you run this, you will see the performance of the regressor displayed, as follows:

#### Random Forest regressor performance ####
Mean squared error = 2619.87
Explained variance score = 0.92

The feature importance graph will look like the following:

This shows that the hour of the day is the most important feature, which makes sense intuitively if you think about it! The next important feature is temperature, which is consistent with our earlier analysis.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand which algorithms to use in a given context with the help of this exciting recipe-based guide
  • Learn about perceptrons and see how they are used to build neural networks
  • Stuck while making sense of images, text, speech, and real estate? This guide will come to your rescue, showing you how to perform machine learning for each one of these using various techniques


Machine learning is becoming increasingly pervasive in the modern data-driven world. It is used extensively across many fields such as search engines, robotics, self-driving cars, and more. With this book, you will learn how to perform various machine learning tasks in different environments. We’ll start by exploring a range of real-life scenarios where machine learning can be used, and look at various building blocks. Throughout the book, you’ll use a wide variety of machine learning algorithms to solve real-world problems and use Python to implement these algorithms. You’ll discover how to deal with various types of data and explore the differences between machine learning paradigms such as supervised and unsupervised learning. We also cover a range of regression techniques, classification algorithms, predictive modeling, data visualization techniques, recommendation engines, and more with the help of real-world examples.

What you will learn

[*]Explore classification algorithms and apply them to the income bracket estimation problem [*]Use predictive modeling and apply it to real-world problems [*]Understand how to perform market segmentation using unsupervised learning [*]Explore data visualization techniques to interact with your data in diverse ways [*]Find out how to build a recommendation engine [*]Understand how to interact with text data and build models to analyze it [*]Work with speech data and recognize spoken words using Hidden Markov Models [*]Analyze stock market data using Conditional Random Fields [*]Work with image data and build systems for image recognition and biometric face recognition [*]Grasp how to use deep neural networks to build an optical character recognition system

Product Details

Country selected

Publication date : Jun 23, 2016
Length 304 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781786464477
Category :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details

Publication date : Jun 23, 2016
Length 304 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781786464477
Category :

Table of Contents

19 Chapters
Python Machine Learning Cookbook Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewer Chevron down icon Chevron up icon Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
1. The Realm of Supervised Learning Chevron down icon Chevron up icon
2. Constructing a Classifier Chevron down icon Chevron up icon
3. Predictive Modeling Chevron down icon Chevron up icon
4. Clustering with Unsupervised Learning Chevron down icon Chevron up icon
5. Building Recommendation Engines Chevron down icon Chevron up icon
6. Analyzing Text Data Chevron down icon Chevron up icon
7. Speech Recognition Chevron down icon Chevron up icon
8. Dissecting Time Series and Sequential Data Chevron down icon Chevron up icon
9. Image Content Analysis Chevron down icon Chevron up icon
10. Biometric Face Recognition Chevron down icon Chevron up icon
11. Deep Neural Networks Chevron down icon Chevron up icon
12. Visualizing Data Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Top Reviews
No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial


How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to
  • To contact us directly if a problem is not resolved, use
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.