Packt+ | Advance your knowledge in tech

You're reading from Regression Analysis with Python

Product type Book

Published in Feb 2016

Publisher

ISBN-13 9781785286315

Pages 312 pages

Edition 1st Edition

Languages

Python

Concepts

Statistics

Authors (2):

Luca Massaron

Alberto Boschetti

View More author details

Table of Contents (16) Chapters

Regression Analysis with Python

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

1. Regression – The Workhorse of Data Science

2. Approaching Simple Linear Regression

3. Multiple Regression in Action

4. Logistic Regression

5. Data Preparation

6. Achieving Generalization

7. Online and Batch Learning

8. Advanced Regression Methods

9. Real-world Applications for Regression Models

Index

Chapter 2. Approaching Simple Linear Regression

Having set up all your working tools (directly installing Python and IPython or using a scientific distribution), you are now ready to start using linear models to incorporate new abilities into the software you plan to build, especially predictive capabilities. Up to now, you have developed software solutions based on certain specifications you defined (or specifications that others have handed to you). Your approach has always been to tailor the response of the program to particular inputs, by writing code carefully mapping every single situation to a specific, predetermined response. Reflecting on it, by doing so you were just incorporating practices that you (or others) have learned from experience.

However, the world is complex, and sometimes your experience is not enough to make your software smart enough to make a difference in a fairly competitive business or in challenging problems with many different and mutable facets.

In this chapter...

Defining a regression problem

Thanks to machine learning algorithms, deriving knowledge from data is possible. Machine learning has solid roots in years of research: it has really been a long journey since the end of the fifties, when Arthur Samuel clarified machine learning as being a "field of study that gives computers the ability to learn without being explicitly programmed."

The data explosion (the availability of previously unrecorded amounts of data) has enabled the widespread usage of both recent and classic machine learning techniques and made them high-performance techniques. If nowadays you can talk by voice to your mobile phone and expect it to answer properly to you, acting as your secretary (such as Siri or Google Now), it is uniquely because of machine learning. The same holds true for every application based on machine learning such as face recognition, search engines, spam filters, recommender systems for books/music/movies, handwriting recognition, and automatic language...

Starting from the basics

We will start exploring the first dataset, the Boston dataset, but before delving into numbers, we will upload a series of helpful packages that will be used during the rest of the chapter:

In: import numpy as np
  import pandas as pd
  import matplotlib.pyplot as plt
  import matplotlib as mpl

If you are working from an IPython Notebook, running the following command in a cell will instruct the Notebook to represent any graphic output in the Notebook itself (otherwise, if you are not working on IPython, just ignore the command because it won't work in IDEs such as Python's IDLE or Spyder):

In: %matplotlib inline
  # If you are using IPython, this will make the images available in the Notebook

To immediately select the variables that we need, we just frame all the data available into a Pandas data structure, DataFrame.

Inspired by a similar data structure present in the R statistical language, a DataFrame renders data vectors of different types easy to handle under...

Extending to linear regression

Linear regression tries to fit a line through a given set of points, choosing the best fit. The best fit is the line that minimizes the summed squared difference between the value dictated by the line for a certain value of x and its corresponding y values. (It is optimizing the same squared error that we met before when checking how good a mean was as a predictor.)

Since linear regression is a line; in bi-dimensional space (x, y), it takes the form of the classical formula of a line in a Cartesian plane: y = mx + q, where m is the angular coefficient (expressing the angle between the line and the x axis) and q is the intercept between the line and the x axis.

Formally, machine learning indicates the correct expression for a linear regression as follows:

Here, again, X is a matrix of the predictors, β is a matrix of coefficients, and β₀ is a constant value called the bias (it is the same as the Cartesian formulation, only the notation is different).

We can better...

Minimizing the cost function

At the core of linear regression, there is the search for a line's equation that it is able to minimize the sum of the squared errors of the difference between the line's y values and the original ones. As a reminder, let's say our regression function is called h, and its predictions h(X), as in this formulation:

Consequently, our cost function to be minimized is as follows:

There are quite a few methods to minimize it, some performing better than others in the presence of large quantities of data. Among the better performers, the most important ones are Pseudoinverse (you can find this in books on statistics), QR factorization, and gradient descent.

Explaining the reason for using squared errors

Looking under the hood of a linear regression analysis, at first it could be puzzling to realize that we are striving to minimize the squared differences between our estimates and the data from which we are building the model. Squared differences are not as intuitively explainable...

Summary

In this chapter, we introduced linear regression as a supervised machine learning algorithm. We explained its functional form, its relationship with the statistical measures of mean and correlation, and we tried to build a simple linear regression model on the Boston house prices data. After doing that we finally glanced at how regression works under the hood by proposing its key mathematical formulations and their translation into Python code.

In the next chapter, we will continue our discourse about linear regression, extending our predictors to multiple variables and carrying on our explanation where we left it suspended during our initial illustration with a single variable. We will also point out the most useful transformations you can apply to data to make it suitable for processing by a linear regression algorithm.