cluster
|
This is the unsupervised clustering
|
KMeans and Ward
|
decomposition
|
This is the dimensionality reduction
|
PCA and NMF
|
ensemble
|
This involves ensemble-based methods
|
AdaBoostClassifier,
AdaBoostRegressor,
RandomForestClassifier,
RandomForestRegressor
|
lda
|
This stands for latent discriminant analysis
|
LDA
|
linear_model
|
This is the generalized linear model
|
LinearRegression, LogisticRegression,
Lasso and Perceptron
|
mixture
|
This is the mixture model
|
GMM and VBGMM
|
naive_bayes
|
This involves supervised learning based on Bayes' theorem
|
BaseNB and BernoulliNB, GaussianNB
|
neighbors
|
These are k-nearest neighbors
|
KNeighborsClassifier, KNeighborsRegressor... |
Data representation in scikit-learn
In contrast to the heterogeneous domains and applications of machine learning, the data representation in scikit-learn is less diverse, and the basic format that many algorithms expect is straightforward—a matrix of samples and features.
The underlying data structure is a numpy
and the ndarray
. Each row in the matrix corresponds to one sample and each column to the value of one feature.
There is something like Hello World
in the world of machine learning datasets as well; for example, the Iris dataset whose origins date back to 1936. With the standard installation of scikit-learn, you already have access to a couple of datasets, including Iris that consists of 150 samples, each consisting of four measurements taken from three different Iris flower species:
The dataset is packaged as a bunch, which is only a thin wrapper around a dictionary:
Supervised learning – classification and regression
In this section, we will show short examples for both classification and regression.
Classification problems are pervasive: document categorization, fraud detection, market segmentation in business intelligence, and protein function prediction in bioinformatics.
While it might be possible for hand-craft rules to assign a category or label to new data, it is faster to use algorithms to learn and generalize from the existing data.
We will continue with the Iris dataset. Before we apply a learning algorithm, we want to get an intuition of the data by looking at some values and plots.
All measurements share the same dimension, which helps to visualize the variance in various boxplots:
We see that the petal length (the third feature) exhibits the biggest variance, which could indicate the importance of this feature during classification. It is also insightful to plot the data points in two dimensions, using one feature for each axis. Also, indeed...
Unsupervised learning – clustering and dimensionality reduction
A lot of existing data is not labeled. It is still possible to learn from data without labels with unsupervised models. A typical task during exploratory data analysis is to find related items or clusters. We can imagine the Iris dataset, but without the labels:
While the task seems much harder without labels, one group of measurements (in the lower-left) seems to stand apart. The goal of clustering algorithms is to identify these groups.
We will use K-Means clustering on the Iris dataset (without the labels). This algorithm expects the number of clusters to be specified in advance, which can be a disadvantage. K-Means will try to partition the dataset into groups, by minimizing the within-cluster sum of squares.
For example, we instantiate the KMeans
model with n_clusters
equal to 3
:
Similar to supervised algorithms, we can use the fit
methods...
Measuring prediction performance
We have already seen that the machine learning process consists of the following steps:
Model selection: We first select a suitable model for our data. Do we have labels? How many samples are available? Is the data separable? How many dimensions do we have? As this step is nontrivial, the choice will depend on the actual problem. As of Fall 2015, the scikit-learn documentation contains a much appreciated flowchart called choosing the right estimator. It is short, but very informative and worth taking a closer look at.
Training: We have to bring the model and data together, and this usually happens in the fit methods of the models in scikit-learn.
Application: Once we have trained our model, we are able to make predictions about the unseen data.
So far, we omitted an important step that takes place between the training and application: the model testing and validation. In this step, we want to evaluate how well our model has learned.
One goal of learning, and...
In this chapter, we took a whirlwind tour through one of the most popular Python machine learning libraries: scikit-learn. We saw what kind of data this library expects. Real-world data will seldom be ready to be fed into an estimator right away. With powerful libraries, such as Numpy and, especially, Pandas, you already saw how data can be retrieved, combined, and brought into shape. Visualization libraries, such as matplotlib, help along the way to get an intuition of the datasets, problems, and solutions.
During this chapter, we looked at a canonical dataset, the Iris dataset. We also looked at it from various angles: as a problem in supervised and unsupervised learning and as an example for model verification.
In total, we have looked at four different algorithms: the Support Vector Machine, Linear Regression, K-Means clustering, and Principal Component Analysis. Each of these alone is worth exploring, and we barely scratched the surface, although we were able to implement all...