Packt+ | Advance your knowledge in tech

You're reading from Regression Analysis with Python

Product type Book

Published in Feb 2016

Publisher

ISBN-13 9781785286315

Pages 312 pages

Edition 1st Edition

Languages

Python

Concepts

Statistics

Authors (2):

Luca Massaron

Alberto Boschetti

View More author details

Table of Contents (16) Chapters

Regression Analysis with Python

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

1. Regression – The Workhorse of Data Science

2. Approaching Simple Linear Regression

3. Multiple Regression in Action

4. Logistic Regression

5. Data Preparation

6. Achieving Generalization

7. Online and Batch Learning

8. Advanced Regression Methods

9. Real-world Applications for Regression Models

Index

Chapter 9. Real-world Applications for Regression Models

We have arrived at the concluding chapter of the book. In respect of the previous chapters, the present one is very practical in its essence, since it mostly contains lots of code and no math or other theoretical explanation. It comprises four practical examples of real-world data science problems solved using linear models. The ultimate goal is to demonstrate how to approach such problems and how to develop the reasoning behind their resolution, so that they can be used as blueprints for similar challenges you'll encounter.

For each problem, we will describe the question to be answered, provide a short description of the dataset, and decide the metric we strive to maximize (or the error we want to minimize). Then, throughout the code, we will provide ideas and intuitions that are key to successfully completing each one. In addition, when run, the code will produce verbose output from the modeling, in order to provide the reader with...

Downloading the datasets

In this section of the book, we will download all the datasets that are going to be used in the examples in this chapter. We chose to store them in separate subdirectories of the same folder where the IPython Notebook is contained. Note that some of them are quite big (100+ MB).

Tip

We would like to thank the maintainers and the creators of the UCI dataset archive. Thanks to such repositories, modeling and achieving experiment repeatability are much easier than before. The UCI archive is from Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

For each dataset, we first download it, and then we present the first couple of lines. First, this will help demonstrate whether the file has been correctly downloaded, unpacked, and placed into the right location; second, it will show the structure of the file itself (header, fields, and so on):

In:
try:
    import...

A regression problem

Given some descriptors of a song, the goal of this problem is to predict the year when the song was produced. That's basically a regression problem, since the target variable to predict is a number in the range between 1922 and 2011.

For each song, in addition to the year of production, 90 attributes are provided. All of them are related to the timbre: 12 of them relate to the timbre average and 78 attributes describe the timbre's covariance; all the features are numerical (integer or floating point numbers).

The dataset is composed of more than half a million observations. As for the competition behind the dataset, the authors tried to achieve the best results using the first 463,715 observations as a training set and the remaining 51,630 for testing.

The metric used to evaluate the results is the Mean Absolute Error (MAE) between the predicted year and the real year of production for the songs composing the testing set. The goal is to minimize the error measure.

Note

The...

An imbalanced and multiclass classification problem

Given some descriptors of a sequence of packets, flowing to/from a host connected to the Internet, the goal of this problem is to detect whether that sequence signals a malicious attack or not. If it does, we should also classify the type of attack. That's a multiclass classification problem, since the possible labels are multiple ones.

For each observation, 42 features are revealed: please note that some of them are categorical, whereas others are numerical. The dataset is composed of almost 5 million observations (but in this exercise we're using just the first million, to avoid memory constraints), and the number of possible labels is 23. One of them represents a non-malicious situation (normal); all the others represent 22 different network attacks. Some attention should be paid to the fact that the frequencies of response classes are imbalanced: for some attacks there are multiple observations, for others just a few.

No instruction is...

A ranking problem

Given some descriptors of a car and its price, the goal of this problem is to predict the degree to which the car is riskier than its price indicates. Actuaries in the insurance business call this process symboling, and the outcome is a rank: a value of +3 indicates the car is risky; -3 indicates that it's pretty safe (although the lowest value in the dataset is -2).

The description of the car includes its specifications in terms of various characteristics (brand, fuel type, body style, length, and so on). Moreover, you get its price and normalized loss in use as compared to other cars (this represents the average loss per car per year, normalized for all cars within a certain car segment).

There are 205 cars in the dataset, and the number of features is 25; some of them are categorical, and others are numerical. In addition, the dataset expressively states that there are some missing values, encoded using the string "?".

Although it is not stated directly on the presentation...

A time series problem

The last problem we're going to see in this chapter is about prediction in time. The standard name for these problems is time series analysis, since the prediction is made on descriptors extracted in the past; therefore, the outcome at the current time will become a feature for the prediction of the next point in time. In this exercise, we're using the closing values for several stocks composing the Dow Jones index in 2011.

Several features compose the dataset, but in this problem (to make a short and complete exercise) we're just using the closing values of each week for each of the 30 measured stocks, ordered in time. The dataset spans six months: we're using the first half of the dataset (corresponding to the first quarter of the year under observation, with 12 weeks) to train our algorithm, and the second half (containing the second quarter of the year, with 13 weeks) to test the predictions.

Moreover, since we don't expect readers to have a background in economics...

Summary

In this chapter, we've explored four practical data science examples involving classifiers and regressors. We strongly encourage readers to read, understand, and try to add further steps, in order to boost performance.

The rest of the chapter is locked