Chapter 7. Online and Batch Learning
In this chapter, you will be presented with best practices when it comes to training classifiers on big data. The new approach, exposed in the following pages, is both scalable and generic, making it perfect for datasets with a huge number of observations. Moreover, this approach can allow you to cope with streaming datasets—that is, datasets with observations transmitted on-the-fly and not all available at the same time. Furthermore, such an approach enhances precision, as more data is fed in during the training process.
With respect to the classic approach seen so far in the book, batch learning, this new approach is, not surprisingly, called online learning. The core of online learning is the divide et impera (divide and conquer) principle whereby each step of a mini-batch of the data serves as input to train and improve the classifier.
In this chapter, we will first focus on batch learning and its limitations, and then introduce online learning. Finally...
When the dataset is fully available at the beginning of a supervised task, and doesn't exceed the quantity of RAM on your machine, you can train the classifier or the regression using batch learning. As seen in previous chapters, during training the learner scans the full dataset. This also happens when stochastic gradient descent (SGD)-based methods are used (see Chapter 2, Approaching Simple Linear Regression and Chapter 3, Multiple Regression in Action). Let's now compare how much time is needed to train a linear regressor and relate its performance with the number of observations in the dataset (that is, the number of rows of the feature matrix X) and the number of features (that is, the number of columns of X). In this first experiment, we will use the plain vanilla LinearRegression()
and SGDRegressor()
classes provided by Scikit-learn, and we will store the actual time taken to fit a classifier, without any parallelization.
Let's first create a function to create fake...
Online mini-batch learning
From the previous section, we've learned an interesting lesson: for big data, always use SGD-based learners because they are faster, and they do scale.
Now, in this section, let's consider this regression dataset:
The X_train
matrix is composed of 200 million elements, and may not completely fit in memory (on a machine with 4 GB RAM); the testing set is composed of 10,000 observations.
Let's first create the datasets, and print the memory footprint of the biggest one:
The X_train
matrix is itself 1.6
GB of data; we can consider it as a starting point for big data. Let's now try to classify it using the best model we got from the previous section, SGDRegressor()
. To access...
In this chapter, we've introduced the concepts of batch and online learning, which are necessary to be able to process big datasets (big data) in a quick and scalable way.
In the next chapter, we will explore some advanced techniques of machine learning that will produce great results for some classes of well-known problems.