Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Getting Started with Python Data Analysis

You're reading from  Getting Started with Python Data Analysis

Product type Book
Published in Nov 2015
Publisher
ISBN-13 9781785285110
Pages 188 pages
Edition 1st Edition
Languages

Table of Contents (15) Chapters

Getting Started with Python Data Analysis
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
1. Introducing Data Analysis and Libraries 2. NumPy Arrays and Vectorized Computation 3. Data Analysis with Pandas 4. Data Visualization 5. Time Series 6. Interacting with Databases 7. Data Analysis Application Examples 8. Machine Learning Models with scikit-learn Index

Chapter 8. Machine Learning Models with scikit-learn

In the previous chapter, we saw how to perform data munging, data aggregation, and grouping. In this chapter, we will see the working of different scikit-learn modules for different models in brief, data representation in scikit-learn, understand supervised and unsupervised learning using an example, and measure prediction performance.

An overview of machine learning models


Machine learning is a subfield of artificial intelligence that explores how machines can learn from data to analyze structures, help with decisions, and make predictions. In 1959, Arthur Samuel defined machine learning as the, "Field of study that gives computers the ability to learn without being explicitly programmed."

A wide range of applications employ machine learning methods, such as spam filtering, optical character recognition, computer vision, speech recognition, credit approval, search engines, and recommendation systems.

One important driver for machine learning is the fact that data is generated at an increasing pace across all sectors; be it web traffic, texts or images, and sensor data or scientific datasets. The larger amounts of data give rise to many new challenges in storage and processing systems. On the other hand, many learning algorithms will yield better results with more data to learn from. The field has received a lot of attention...

The scikit-learn modules for different models


The scikit-learn library is organized into submodules. Each submodule contains algorithms and helper methods for a certain class of machine learning models and approaches.

Here is a sample of those submodules, including some example models:

Data representation in scikit-learn


In contrast to the heterogeneous domains and applications of machine learning, the data representation in scikit-learn is less diverse, and the basic format that many algorithms expect is straightforward—a matrix of samples and features.

The underlying data structure is a numpy and the ndarray. Each row in the matrix corresponds to one sample and each column to the value of one feature.

There is something like Hello World in the world of machine learning datasets as well; for example, the Iris dataset whose origins date back to 1936. With the standard installation of scikit-learn, you already have access to a couple of datasets, including Iris that consists of 150 samples, each consisting of four measurements taken from three different Iris flower species:

>>> import numpy as np
>>> from sklearn import datasets
>>> iris = datasets.load_iris()

The dataset is packaged as a bunch, which is only a thin wrapper around a dictionary:

...

Supervised learning – classification and regression


In this section, we will show short examples for both classification and regression.

Classification problems are pervasive: document categorization, fraud detection, market segmentation in business intelligence, and protein function prediction in bioinformatics.

While it might be possible for hand-craft rules to assign a category or label to new data, it is faster to use algorithms to learn and generalize from the existing data.

We will continue with the Iris dataset. Before we apply a learning algorithm, we want to get an intuition of the data by looking at some values and plots.

All measurements share the same dimension, which helps to visualize the variance in various boxplots:

We see that the petal length (the third feature) exhibits the biggest variance, which could indicate the importance of this feature during classification. It is also insightful to plot the data points in two dimensions, using one feature for each axis. Also, indeed...

Unsupervised learning – clustering and dimensionality reduction


A lot of existing data is not labeled. It is still possible to learn from data without labels with unsupervised models. A typical task during exploratory data analysis is to find related items or clusters. We can imagine the Iris dataset, but without the labels:

While the task seems much harder without labels, one group of measurements (in the lower-left) seems to stand apart. The goal of clustering algorithms is to identify these groups.

We will use K-Means clustering on the Iris dataset (without the labels). This algorithm expects the number of clusters to be specified in advance, which can be a disadvantage. K-Means will try to partition the dataset into groups, by minimizing the within-cluster sum of squares.

For example, we instantiate the KMeans model with n_clusters equal to 3:

>>> from sklearn.cluster import KMeans
>>> km = KMeans(n_clusters=3)

Similar to supervised algorithms, we can use the fit methods...

Measuring prediction performance


We have already seen that the machine learning process consists of the following steps:

  • Model selection: We first select a suitable model for our data. Do we have labels? How many samples are available? Is the data separable? How many dimensions do we have? As this step is nontrivial, the choice will depend on the actual problem. As of Fall 2015, the scikit-learn documentation contains a much appreciated flowchart called choosing the right estimator. It is short, but very informative and worth taking a closer look at.

  • Training: We have to bring the model and data together, and this usually happens in the fit methods of the models in scikit-learn.

  • Application: Once we have trained our model, we are able to make predictions about the unseen data.

So far, we omitted an important step that takes place between the training and application: the model testing and validation. In this step, we want to evaluate how well our model has learned.

One goal of learning, and...

Summary


In this chapter, we took a whirlwind tour through one of the most popular Python machine learning libraries: scikit-learn. We saw what kind of data this library expects. Real-world data will seldom be ready to be fed into an estimator right away. With powerful libraries, such as Numpy and, especially, Pandas, you already saw how data can be retrieved, combined, and brought into shape. Visualization libraries, such as matplotlib, help along the way to get an intuition of the datasets, problems, and solutions.

During this chapter, we looked at a canonical dataset, the Iris dataset. We also looked at it from various angles: as a problem in supervised and unsupervised learning and as an example for model verification.

In total, we have looked at four different algorithms: the Support Vector Machine, Linear Regression, K-Means clustering, and Principal Component Analysis. Each of these alone is worth exploring, and we barely scratched the surface, although we were able to implement all...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Getting Started with Python Data Analysis
Published in: Nov 2015 Publisher: ISBN-13: 9781785285110
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}

Submodule

Description

Example models

cluster

This is the unsupervised clustering

KMeans and Ward

decomposition

This is the dimensionality reduction

PCA and NMF

ensemble

This involves ensemble-based methods

AdaBoostClassifier,

AdaBoostRegressor,

RandomForestClassifier,

RandomForestRegressor

lda

This stands for latent discriminant analysis

LDA

linear_model

This is the generalized linear model

LinearRegression, LogisticRegression,

Lasso and Perceptron

mixture

This is the mixture model

GMM and VBGMM

naive_bayes

This involves supervised learning based on Bayes' theorem

BaseNB and BernoulliNB, GaussianNB

neighbors

These are k-nearest neighbors

KNeighborsClassifier, KNeighborsRegressor...