Packt+ | Advance your knowledge in tech

You're reading from Regression Analysis with Python

Product type Book

Published in Feb 2016

Publisher

ISBN-13 9781785286315

Pages 312 pages

Edition 1st Edition

Languages

Python

Concepts

Statistics

Authors (2):

Luca Massaron

Alberto Boschetti

View More author details

Table of Contents (16) Chapters

Regression Analysis with Python

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

1. Regression – The Workhorse of Data Science

2. Approaching Simple Linear Regression

3. Multiple Regression in Action

4. Logistic Regression

5. Data Preparation

6. Achieving Generalization

7. Online and Batch Learning

8. Advanced Regression Methods

9. Real-world Applications for Regression Models

Index

Chapter 7. Online and Batch Learning

In this chapter, you will be presented with best practices when it comes to training classifiers on big data. The new approach, exposed in the following pages, is both scalable and generic, making it perfect for datasets with a huge number of observations. Moreover, this approach can allow you to cope with streaming datasets—that is, datasets with observations transmitted on-the-fly and not all available at the same time. Furthermore, such an approach enhances precision, as more data is fed in during the training process.

With respect to the classic approach seen so far in the book, batch learning, this new approach is, not surprisingly, called online learning. The core of online learning is the divide et impera (divide and conquer) principle whereby each step of a mini-batch of the data serves as input to train and improve the classifier.

In this chapter, we will first focus on batch learning and its limitations, and then introduce online learning. Finally...

Batch learning

When the dataset is fully available at the beginning of a supervised task, and doesn't exceed the quantity of RAM on your machine, you can train the classifier or the regression using batch learning. As seen in previous chapters, during training the learner scans the full dataset. This also happens when stochastic gradient descent (SGD)-based methods are used (see Chapter 2, Approaching Simple Linear Regression and Chapter 3, Multiple Regression in Action). Let's now compare how much time is needed to train a linear regressor and relate its performance with the number of observations in the dataset (that is, the number of rows of the feature matrix X) and the number of features (that is, the number of columns of X). In this first experiment, we will use the plain vanilla LinearRegression() and SGDRegressor() classes provided by Scikit-learn, and we will store the actual time taken to fit a classifier, without any parallelization.

Let's first create a function to create fake...

Online mini-batch learning

From the previous section, we've learned an interesting lesson: for big data, always use SGD-based learners because they are faster, and they do scale.

Now, in this section, let's consider this regression dataset:

Massive number of observations: 2M
Large number of features: 100
Noisy dataset

The X_train matrix is composed of 200 million elements, and may not completely fit in memory (on a machine with 4 GB RAM); the testing set is composed of 10,000 observations.

Let's first create the datasets, and print the memory footprint of the biggest one:

In:
# Let's generate a 1M dataset
X_train, X_test, y_train, y_test = generate_dataset(2000000, 10000, 100, 10.0)
print("Size of X_train is [GB]:", X_train.size * X_train[0,0].itemsize/1E9)

Out:
Size of X_train is [GB]: 1.6

The X_train matrix is itself 1.6 GB of data; we can consider it as a starting point for big data. Let's now try to classify it using the best model we got from the previous section, SGDRegressor(). To access...

Summary

In this chapter, we've introduced the concepts of batch and online learning, which are necessary to be able to process big datasets (big data) in a quick and scalable way.

In the next chapter, we will explore some advanced techniques of machine learning that will produce great results for some classes of well-known problems.

The rest of the chapter is locked

You're reading from Regression Analysis with Python

Table of Contents (16) Chapters

Chapter 7. Online and Batch Learning

Batch learning

Online mini-batch learning

Summary

Unlock this book and the full library FREE for 7 days

Authors (2)

Personalised recommendations for you