Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Regression Analysis with Python

You're reading from  Regression Analysis with Python

Product type Book
Published in Feb 2016
Publisher
ISBN-13 9781785286315
Pages 312 pages
Edition 1st Edition
Languages
Concepts
Authors (2):
Luca Massaron Luca Massaron
Profile icon Luca Massaron
Alberto Boschetti Alberto Boschetti
Profile icon Alberto Boschetti
View More author details

Table of Contents (16) Chapters

Regression Analysis with Python
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
1. Regression – The Workhorse of Data Science 2. Approaching Simple Linear Regression 3. Multiple Regression in Action 4. Logistic Regression 5. Data Preparation 6. Achieving Generalization 7. Online and Batch Learning 8. Advanced Regression Methods 9. Real-world Applications for Regression Models Index

Chapter 7. Online and Batch Learning

In this chapter, you will be presented with best practices when it comes to training classifiers on big data. The new approach, exposed in the following pages, is both scalable and generic, making it perfect for datasets with a huge number of observations. Moreover, this approach can allow you to cope with streaming datasets—that is, datasets with observations transmitted on-the-fly and not all available at the same time. Furthermore, such an approach enhances precision, as more data is fed in during the training process.

With respect to the classic approach seen so far in the book, batch learning, this new approach is, not surprisingly, called online learning. The core of online learning is the divide et impera (divide and conquer) principle whereby each step of a mini-batch of the data serves as input to train and improve the classifier.

In this chapter, we will first focus on batch learning and its limitations, and then introduce online learning. Finally...

Batch learning


When the dataset is fully available at the beginning of a supervised task, and doesn't exceed the quantity of RAM on your machine, you can train the classifier or the regression using batch learning. As seen in previous chapters, during training the learner scans the full dataset. This also happens when stochastic gradient descent (SGD)-based methods are used (see Chapter 2, Approaching Simple Linear Regression and Chapter 3, Multiple Regression in Action). Let's now compare how much time is needed to train a linear regressor and relate its performance with the number of observations in the dataset (that is, the number of rows of the feature matrix X) and the number of features (that is, the number of columns of X). In this first experiment, we will use the plain vanilla LinearRegression() and SGDRegressor() classes provided by Scikit-learn, and we will store the actual time taken to fit a classifier, without any parallelization.

Let's first create a function to create fake...

Online mini-batch learning


From the previous section, we've learned an interesting lesson: for big data, always use SGD-based learners because they are faster, and they do scale.

Now, in this section, let's consider this regression dataset:

  • Massive number of observations: 2M

  • Large number of features: 100

  • Noisy dataset

The X_train matrix is composed of 200 million elements, and may not completely fit in memory (on a machine with 4 GB RAM); the testing set is composed of 10,000 observations.

Let's first create the datasets, and print the memory footprint of the biggest one:

In:
# Let's generate a 1M dataset
X_train, X_test, y_train, y_test = generate_dataset(2000000, 10000, 100, 10.0)
print("Size of X_train is [GB]:", X_train.size * X_train[0,0].itemsize/1E9)

Out:
Size of X_train is [GB]: 1.6

The X_train matrix is itself 1.6 GB of data; we can consider it as a starting point for big data. Let's now try to classify it using the best model we got from the previous section, SGDRegressor(). To access...

Summary


In this chapter, we've introduced the concepts of batch and online learning, which are necessary to be able to process big datasets (big data) in a quick and scalable way.

In the next chapter, we will explore some advanced techniques of machine learning that will produce great results for some classes of well-known problems.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Regression Analysis with Python
Published in: Feb 2016 Publisher: ISBN-13: 9781785286315
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}