Reader small image

You're reading from  Designing Machine Learning Systems with Python

Product typeBook
Published inApr 2016
Reading LevelBeginner
Publisher
ISBN-139781785882951
Edition1st Edition
Languages
Right arrow
Author (1)
David Julian
David Julian
author image
David Julian

David Julian is a freelance technology consultant and educator. He has worked as a consultant for government, private, and community organizations on a variety of projects, including using machine learning to detect insect outbreaks in controlled agricultural environments (Urban Ecological Systems Ltd., Bluesmart Farms), designing and implementing event management data systems (Sustainable Industry Expo, Lismore City Council), and designing multimedia interactive installations (Adelaide University). He has also written Designing Machine Learning Systems With Python for Packt Publishing and was a technical reviewer for Python Machine Learning and Hands-On Data Structures and Algorithms with Python - Second Edition, published by Packt.
Read more about David Julian

Right arrow

Chapter 2. Tools and Techniques

Python comes equipped with a large library of packages for machine learning tasks.

The packages we will look at in this chapter are as follows:

  • The IPython console

  • NumPy, which is an extension that adds support for multi-dimensional arrays, matrices, and high-level mathematical functions

  • SciPy, which is a library of scientific formulae, constants, and mathematical functions

  • Matplotlib, which is for creating plots

  • Scikit-learn, which is a library for machine learning tasks such as classification, regression, and clustering

There is only enough space to give you a flavor of these huge libraries, and an important skill is being able to find and understand the reference material for the various packages. It is impossible to present all the different functionality in a tutorial style documentation, and it is important to be able to find your way around the sometimes dense API references. A thing to remember is that the majority of these packages are put together by the...

Python for machine learning


Python is a versatile general purpose programming language. It is an interpreted language and can run interactively from a console. It does not require a compiler like C++ or Java, so the development time tends to be shorter. It is available for free download and can be installed on many different operating systems including UNIX, Windows, and Macintosh. It is especially popular for scientific and mathematical applications. Python is relatively easy to learn compared to languages such as C++ and Java, with similar tasks using fewer lines of code.

Python is not the only platform for machine learning, but it is certainly one of the most used. One of its major alternatives is R. Like Python, it is open source, and while it is popular for applied machine learning, it lacks the large development community of Python. R is a specialized tool for machine learning and statistical analysis. Python is a general-purpose, widely-used programming language that also has excellent...

IPython console


The Ipython package has had some significant changes with the release of version 4. A former monolithic package structure, it has been split into sub-packages. Several IPython projects have split into their own separate project. Most of the repositories have been moved to the Jupyter project (jupyter.org).

At the core of IPython is the IPython console: a powerful interactive interpreter that allows you to test your ideas in a very fast and intuitive way. Instead of having to create, save, and run a file every time you want to test a code snippet, you can simply type it into a console. A powerful feature of IPython is that it decouples the traditional read-evaluate-print loop that most computing platforms are based on. IPython puts the evaluate phase into its own process: a kernel (not to be confused with the kernel function used in machine learning algorithms). Importantly, more than one client can access the kernel. This means you can run code in a number of files and access...

Installing the SciPy stack


The SciPy stack consists of Python along with the most commonly used scientific, mathematical, and ML libraries. (visit: scipy.org). These include NumPy, Matplotlib, the SciPy library itself, and IPython. The packages can be installed individually on top of an existing Python installation, or as a complete distribution (distro). The easiest way to get started is using a distro, if you have not got Python installed on your computer. The major Python distributions are available for most platforms, and they contain everything you need in one package. Installing all the packages and their dependencies separately does take some time, but it may be an option if you already have a configured Python installation on your machine.

Most distributions give you all the tools you need, and many come with powerful developer environments. Two of the best are Anaconda (www.continuum.io/downloads) and Canopy (http://www.enthought.com/products/canopy/). Both have free and commercial...

NumPY


We should know that there is a hierarchy of types for representing data in Python. At the root are immutable objects such as integers, floats, and Boolean. Built on this, we have sequence types. These are ordered sets of objects indexed by non-negative integers. They are iterative objects that include strings, lists, and tuples. Sequence types have a common set of operations such as returning an element (s[i]) or a slice (s[i:j]), and finding the length (len(s)) or the sum (sum(s)). Finally, we have mapping types. These are collections of objects indexed by another collection of key objects. Mapping objects are unordered and are indexed by numbers, strings, or other objects. The built-in Python mapping type is the dictionary.

NumPy builds on these data objects by providing two further objects: an N-dimensional array object (ndarray) and a universal function object (ufunc). The ufunc object provides element-by-element operations on ndarray objects, allowing typecasting and array broadcasting...

Matplotlib


Matplotlib, or more importantly, its sub-package PyPlot, is an essential tool for visualizing two-dimensional data in Python. I will only mention it briefly here because its use should become apparent as we work through the examples. It is built to work like Matlab with command style functions. Each PyPlot function makes some change to a PyPlot instance. At the core of PyPlot is the plot method. The simplest implementation is to pass plot a list or a 1D array. If only one argument is passed to plot, it assumes it is a sequence of y values, and it will automatically generate the x values. More commonly, we pass plot two 1D arrays or lists for the co-ordinates x and y. The plot method can also accept an argument to indicate line properties such as line width, color, and style. Here is an example:

import numpy as np
import matplotlib.pyplot as plt

x = np.arange(0., 5., 0.2)
plt.plot(x, x**4, 'r', x, x*90, 'bs', x, x**3, 'g^')
plt.show()

This code prints three lines in different...

Pandas


The Pandas library builds on NumPy by introducing several useful data structures and functionalities to read and process data. Pandas is a great tool for general data munging. It easily handles common tasks such as dealing with missing data, manipulating shapes and sizes, converting between data formats and structures, and importing data from different sources.

The main data structures introduced by Pandas are:

  • Series

  • The DataFrame

  • Panel

The DataFrame is probably the most widely used. It is a two-dimensional structure that is effectively a table created from either a NumPy array, lists, dicts, or series. You can also create a DataFrame by reading from a file.

Probably the best way to get a feel for Pandas is to go through a typical use case. Let's say that we are given the task of discovering how the daily maximum temperature has changed over time. For this example, we will be working with historical weather observations from the Hobart weather station in Tasmania. Download the following...

SciPy


SciPy (pronounced sigh pi) adds a layer to NumPy that wraps common scientific and statistical applications on top of the more purely mathematical constructs of NumPy. SciPy provides higher-level functions for manipulating and visualizing data, and it is especially useful when using Python interactively. SciPy is organized into sub-packages covering different scientific computing applications. A list of the packages most relevant to ML and their functions appear as follows:

Scikit-learn


This includes algorithms for the most common machine learning tasks, such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

Scikit-learn comes with several real-world data sets for us to practice with. Let's take a look at one of these—the Iris data set:

from sklearn import datasets
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
iris_X.shape
(150, 4)

The data set contains 150 samples of three types of irises (Setosa, Versicolor, and Virginica), each with four features. We can get a description on the dataset:

iris.DESCR

We can see that the four attributes, or features, are sepal width, sepal length, petal length, and petal width in centimeters. Each sample is associated with one of three classes. Setosa, Versicolor, and Virginica. These are represented by 0, 1, and 2 respectively.

Let's look at a simple classification problem using this data. We want to predict the type of iris based on its features: the...

Summary


We have seen a basic kit of machine learning tools and a few indications of their uses on simple datasets. What you may be beginning to wonder is how these tools can be applied to real-world problems. There is considerable overlap between each of the libraries we have discussed. Many perform the same task, but add or perform the same function in a different way. Choosing which library to use for each problem is not necessarily a definitive decision. There is no best library; there is only the preferred library, and this varies from person to person, and of course, to the specifics of the application.

In the next chapter, we will look at one of the most important, and often overlooked, aspects of machine learning, that is, data.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Designing Machine Learning Systems with Python
Published in: Apr 2016Publisher: ISBN-13: 9781785882951
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
David Julian

David Julian is a freelance technology consultant and educator. He has worked as a consultant for government, private, and community organizations on a variety of projects, including using machine learning to detect insect outbreaks in controlled agricultural environments (Urban Ecological Systems Ltd., Bluesmart Farms), designing and implementing event management data systems (Sustainable Industry Expo, Lismore City Council), and designing multimedia interactive installations (Adelaide University). He has also written Designing Machine Learning Systems With Python for Packt Publishing and was a technical reviewer for Python Machine Learning and Hands-On Data Structures and Algorithms with Python - Second Edition, published by Packt.
Read more about David Julian

Package

Description

cluster

This contains two sub-packages:

cluster.vq for K-means clustering and vector quantization.

cluster.hierachy for hierarchical and agglomerative clustering, which is useful for distance matrices, calculating statistics on clusters, as well as visualizing clusters with dendrograms.

constants

These are physical and mathematical constants such as pi and e.

integrate

These are differential equation solvers

interpolate

These are interpolation functions for creating new data...