Home Data Python: End-to-end Data Analysis

Python: End-to-end Data Analysis

By Ivan Idris , Luiz Felipe Martins , Martin Czygan and 2 more
books-svg-icon Book
Subscription $15.99
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
Subscription $15.99
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
About this book
Data analysis is the process of applying logical and analytical reasoning to study each component of data present in the system. Python is a multi-domain, high-level, programming language that offers a range of tools and libraries suitable for all purposes, it has slowly evolved as one of the primary languages for data science. Have you ever imagined becoming an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? If yes, look no further, this is the course you need! In this course, we will get you started with Python data analysis by introducing the basics of data analysis and supported Python libraries such as matplotlib, NumPy, and pandas. Create visualizations by choosing color maps, different shapes, sizes, and palettes then delve into statistical data analysis using distribution algorithms and correlations. You’ll then find your way around different data and numerical problems, get to grips with Spark and HDFS, and set up migration scripts for web mining. You’ll be able to quickly and accurately perform hands-on sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decision-making. Finally, you will delve into advanced techniques such as performing regression, quantifying cause and effect using Bayesian methods, and discovering how to use Python’s tools for supervised machine learning. The course provides you with highly practical content explaining data analysis with Python, from the following Packt books: 1. Getting Started with Python Data Analysis. 2. Python Data Analysis Cookbook. 3. Mastering Python Data Analysis. By the end of this course, you will have all the knowledge you need to analyze your data with varying complexity levels, and turn it into actionable insights.
Publication date:
May 2017
Publisher
Packt
ISBN
9781788394697

   

Data is raw information that can exist in any form, usable or not. We can easily get data everywhere in our lives; for example, the price of gold on the day of writing was $ 1.158 per ounce. This does not have any meaning, except describing the price of gold. This also shows that data is useful based on context.

With the relational data connection, information appears and allows us to expand our knowledge beyond the range of our senses. When we possess gold price data gathered over time, one piece of information we might have is that the price has continuously risen from $1.152 to $1.158 over three days. This could be used by someone who tracks gold prices.

Knowledge helps people to create value in their lives and work. This value is based on information that is organized, synthesized, or summarized to enhance comprehension, awareness, or understanding. It represents a state or potential for action and decisions. When the price of gold continuously increases for three days, it will likely decrease on the next day; this is useful knowledge.

The following figure illustrates the steps from data to knowledge; we call this process, the data analysis process and we will introduce it in the next section:

Introducing Data Analysis and Libraries

In this chapter, we will cover the following topics:

Data is getting bigger and more diverse every day. Therefore, analyzing and processing data to advance human knowledge or to create value is a big challenge. To tackle these challenges, you will need domain knowledge and a variety of skills, drawing from areas such as computer science, artificial intelligence (AI) and machine learning (ML), statistics and mathematics, and knowledge domain, as shown in the following figure:

Data analysis and processing

Let's go through data analysis and its domain knowledge:

Data analysis is a process composed of the following steps:

There are numerous data analysis libraries that help us to process and analyze data. They use different programming languages, and have different advantages and disadvantages of solving various data analysis problems. Now, we will introduce some common libraries that may be useful for you. They should give you an overview of the libraries in the field. However, the rest of this book focuses on Python-based libraries.

Some of the libraries that use the Java language for data analysis are as follows:

Here are a few libraries that are implemented in C++:

Other libraries for data processing and analysis are as follows:

Here, I could not list all libraries for data analysis. However, I think the above libraries are enough to take a lot of your time to learn and build data analysis applications. I hope you will enjoy them after reading this book.

Python is a multi-platform, general-purpose programming language that can run on Windows, Linux/Unix, and Mac OS X, and has been ported to Java and .NET virtual machines as well. It has a powerful standard library. In addition, it has many libraries for data analysis: Pylearn2, Hebel, Pybrain, Pattern, MontePython, and MILK. In this book, we will cover some common Python data analysis libraries such as Numpy, Pandas, Matplotlib, PyMongo, and scikit-learn. Now, to help you get started, I will briefly present an overview of each library for those who are less familiar with the scientific Python stack.

NumPy

One of
Pandas

Pandas is a
Matplotlib

Matplotlib PyMongo

MongoDB The scikit-learn library

The
 

NumPy is the fundamental package supported for presenting and computing data with high performance in Python. It provides some interesting features as follows:

NumPy is a good starting package for you to get familiar with arrays and array-oriented computing in data analysis. Also, it is the basic step to learn other, more effective tools such as Pandas, which we will see in the next chapter. We will be using NumPy version 1.9.1.

An array can be used to contain values of a data object in an experiment or simulation step, pixels of an image, or a signal recorded by a measurement device. For example, the latitude of the Eiffel Tower, Paris is 48.858598 and the longitude is 2.294495. It can be presented in a NumPy array object as p:

This is a manual construction of an array using the np.array function. The standard convention to import NumPy is as follows:

You can, of course, put from numpy import * in your code to avoid having to write np. However, you should be careful with this habit because of the potential code conflicts (further information on code conventions can be found in the Python Style Guide, also known as PEP8, at https://www.python.org/dev/peps/pep-0008/).

There are two requirements of a NumPy array: a fixed size at creation and a uniform, fixed data type, with a fixed size in memory. The following functions help you to get information on the p matrix:

>>> p.ndim    # getting dimension of array p
1
>>> p.shape   # getting size of each array dimension
(2,)
>>> len(p)    # getting dimension length of array p
2
>>> p.dtype    # getting data type of array p
dtype('float64')

There are five basic numerical types including Booleans (bool), integers (int), unsigned integers (uint), floating point (float), and complex. They indicate how many bits are needed to represent elements of an array in memory. Besides that, NumPy also has some types, such as intc and intp, that have different bit sizes depending on the platform.

See the following table for a listing of NumPy's supported data types:

Type

Type code

Description

Range of value

bool

 

Boolean stored as a byte

True/False

intc

 

Similar to C int (int32 or int 64)

 

intp

 

Integer used for indexing (same as C size_t)

 

int8, uint8

i1, u1

Signed and unsigned 8-bit integer types

int8: (-128 to 127)

uint8: (0 to 255)

int16, uint16

i2, u2

Signed and unsigned 16-bit integer types

int16: (-32768 to 32767)

uint16: (0 to 65535)

int32, uint32

I4, u4

Signed and unsigned 32-bit integer types

int32: (-2147483648 to 2147483647

uint32: (0 to 4294967295)

int64, uinit64

i8, u8

Signed and unsigned 64-bit integer types

Int64: (-9223372036854775808 to 9223372036854775807)

uint64: (0 to 18446744073709551615)

float16

f2

Half precision float: sign bit, 5 bits exponent, and 10b bits mantissa

 

float32

f4 / f

Single precision float: sign bit, 8 bits exponent, and 23 bits mantissa

 

float64

f8 / d

Double precision float: sign bit, 11 bits exponent, and 52 bits mantissa

 

complex64, complex128, complex256

c8, c16, c32

Complex numbers represented by two 32-bit, 64-bit, and 128-bit floats

 

object

0

Python object type

 

string_

S

Fixed-length string type

Declare a string dtype with length 10, using S10

unicode_

U

Fixed-length Unicode type

Similar to string_ example, we have 'U10'

We can easily convert or cast an array from one dtype to another using the astype method:

There are various functions provided to create an array object. They are very useful for us to create and store data in a multidimensional array in different situations.

Now, in the following table we will summarize some of NumPy's common functions and their use by examples for array creation:

Function

Description

Example

empty, empty_like

Create a new array of the given shape and type, without initializing elements

>>> np.empty([3,2], dtype=np.float64)
array([[0., 0.], [0., 0.], [0., 0.]])
>>> a = np.array([[1, 2], [4, 3]])
>>> np.empty_like(a)
array([[0, 0], [0, 0]])

eye, identity

Create a NxN identity matrix with ones on the diagonal and zero elsewhere

>>> np.eye(2, dtype=np.int)
array([[1, 0], [0, 1]])

ones, ones_like

Create a new array with the given shape and type, filled with 1s for all elements

>>> np.ones(5)
array([1., 1., 1., 1., 1.])
>>> np.ones(4, dtype=np.int)
array([1, 1, 1, 1])
>>> x = np.array([[0,1,2], [3,4,5]])
>>> np.ones_like(x)
array([[1, 1, 1],[1, 1, 1]])

zeros, zeros_like

This is similar to ones, ones_like, but initializing elements with 0s instead

>>> np.zeros(5)
array([0., 0., 0., 0-, 0.])
>>> np.zeros(4, dtype=np.int)
array([0, 0, 0, 0])
>>> x = np.array([[0, 1, 2], [3, 4, 5]])
>>> np.zeros_like(x)
array([[0, 0, 0],[0, 0, 0]])

arange

Create an array with even spaced values in a given interval

>>> np.arange(2, 5)
array([2, 3, 4])
>>> np.arange(4, 12, 5)
array([4, 9])

full, full_like

Create a new array with the given shape and type, filled with a selected value

>>> np.full((2,2), 3, dtype=np.int)
array([[3, 3], [3, 3]])
>>> x = np.ones(3)
>>> np.full_like(x, 2)
array([2., 2., 2.])

array

Create an array from the existing data

>>> np.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])
array([1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])

asarray

Convert the input to an array

>>> a = [3.14, 2.46]
>>> np.asarray(a)
array([3.14, 2.46])

copy

Return an array copy of the given object

>>> a = np.array([[1, 2], [3, 4]])
>>> np.copy(a)
array([[1, 2], [3, 4]])

fromstring

Create 1-D array from a string or text

>>> np.fromstring('3.14 2.17', dtype=np.float, sep=' ')
array([3.14, 2.17])

Many helpful array functions are supported in NumPy for analyzing data. We will list some part of them that are common in use. Firstly, the transposing function is another kind of reshaping form that returns a view on the original data array without copying anything:

In general, we have the swapaxes method that takes a pair of axis numbers and returns a view on the data, without making a copy:

The transposing function is used to do matrix computations; for example, computing the inner matrix product XT.X using np.dot:

Sorting data in an array is also an important demand in processing data. Let's take a look at some sorting functions and their use:

See the following table for a listing of array functions:

Function

Description

Example

sin, cos, tan, cosh, sinh, tanh, arcos, arctan, deg2rad

Trigonometric and hyperbolic functions

>>> a = np.array([0.,30., 45.])
>>> np.sin(a * np.pi / 180)
array([0., 0.5, 0.7071678])

around, round, rint, fix, floor, ceil, trunc

Rounding elements of an array to the given or nearest number

>>> a = np.array([0.34, 1.65])
>>> np.round(a)
array([0., 2.])

sqrt, square, exp, expm1, exp2, log, log10, log1p, logaddexp

Computing the exponents and logarithms of an array

>>> np.exp(np.array([2.25, 3.16]))
array([9.4877, 23.5705])

add, negative, multiply, devide, power, substract, mod, modf, remainder

Set of arithmetic functions on arrays

>>> a = np.arange(6)
>>> x1 = a.reshape(2,3)
>>> x2 = np.arange(3)
>>> np.multiply(x1, x2)
array([[0,1,4],[0,4,10]])

greater, greater_equal, less, less_equal, equal, not_equal

Perform elementwise comparison: >, >=, <, <=, ==, !=

>>> np.greater(x1, x2)
array([[False, False, False], [True, True, True]], dtype = bool)

With the NumPy package, we can easily solve many kinds of data processing tasks without writing complex loops. It is very helpful for us to control our code as well as the performance of the program. In this part, we want to introduce some mathematical and statistical functions.

See the following table for a listing of mathematical and statistical functions:

Function

Description

Example

sum

Calculate the sum of all the elements in an array or along the axis

>>> a = np.array([[2,4], [3,5]])
>>> np.sum(a, axis=0)
array([5, 9])

prod

Compute the product of array elements over the given axis

>>> np.prod(a, axis=1)
array([8, 15])

diff

Calculate the discrete difference along the given axis

>>> np.diff(a, axis=0)
array([[1,1]])

gradient

Return the gradient of an array

>>> np.gradient(a)
[array([[1., 1.], [1., 1.]]), array([[2., 2.], [2., 2.]])]

cross

Return the cross product of two arrays

>>> b = np.array([[1,2], [3,4]])
>>> np.cross(a,b)
array([0, -3])

std, var

Return standard deviation and variance of arrays

>>> np.std(a)
1.1180339
>>> np.var(a)
1.25

mean

Calculate arithmetic mean of an array

>>> np.mean(a)
3.5

where

Return elements, either from x or y, that satisfy a condition

>>> np.where([[True, True], [False, True]], [[1,2],[3,4]], [[5,6],[7,8]])
array([[1,2], [7, 4]])

unique

Return the sorted unique values in an array

>>> id = np.array(['a', 'b', 'c', 'c', 'd'])
>>> np.unique(id)
array(['a', 'b', 'c', 'd'], dtype='|S1')

intersect1d

Compute the sorted and common elements in two arrays

>>> a = np.array(['a', 'b', 'a', 'c', 'd', 'c'])
>>> b = np.array(['a', 'xyz', 'klm', 'd'])
>>> np.intersect1d(a,b)
array(['a', 'd'], dtype='|S3')

Loading and saving data

We can also save and load data to and from a disk, either in text or binary format, by using different supported functions in NumPy package. Saving an array

Arrays are Loading an array

We have two

Linear algebra is a branch of mathematics concerned with vector spaces and the mappings between those spaces. NumPy has a package called linalg that supports powerful linear algebra functions. We can use these functions to find eigenvalues and eigenvectors or to perform singular value decomposition:

The function is implemented using the geev Lapack routines that compute the eigenvalues and eigenvectors of general square matrices.

Another common problem is solving linear systems such as Ax = b with A as a matrix and x and b as vectors. The problem can be solved easily using the numpy.linalg.solve function:

The following table will summarise some commonly used functions in the numpy.linalg package:

Function

Description

Example

dot

Calculate the dot product of two arrays

>>> a = np.array([[1, 0],[0, 1]])
>>> b = np.array( [[4, 1],[2, 2]])
>>> np.dot(a,b)
array([[4, 1],[2, 2]])

inner, outer

Calculate the inner and outer product of two arrays

>>> a = np.array([1, 1, 1])
>>> b = np.array([3, 5, 1])
>>> np.inner(a,b)
9

linalg.norm

Find a matrix or vector norm

>>> a = np.arange(3)
>>> np.linalg.norm(a)
2.23606

linalg.det

Compute the determinant of an array

>>> a = np.array([[1,2],[3,4]])
>>> np.linalg.det(a)
-2.0

linalg.inv

Compute the inverse of a matrix

>>> a = np.array([[1,2],[3,4]])
>>> np.linalg.inv(a)
array([[-2., 1.],[1.5, -0.5]])

linalg.qr

Calculate the QR decomposition

>>> a = np.array([[1,2],[3,4]])
>>> np.linalg.qr(a)
(array([[0.316, 0.948], [0.948, 0.316]]), array([[ 3.162, 4.427], [ 0., 0.632]]))

linalg.cond

Compute the condition number of a matrix

>>> a = np.array([[1,3],[2,4]])
>>> np.linalg.cond(a)
14.933034

trace

Compute the sum of the diagonal element

>>> np.trace(np.arange(6)).
reshape(2,3))
4

An important part of any simulation is the ability to generate random numbers. For this purpose, NumPy provides various routines in the submodule random. It uses a particular algorithm, called the Mersenne Twister, to generate pseudorandom numbers.

First, we need to define a seed that makes the random numbers predictable. When the value is reset, the same numbers will appear every time. If we do not assign the seed, NumPy automatically selects a random seed value based on the system's random number generator device or on the clock:

An array of random numbers in the [0.0, 1.0] interval can be generated as follows:

If we want to generate random integers in the half-open interval [min, max], we can user the randint(min, max, length) function:

NumPy also provides for many other distributions, including the Beta, bionomial, chi-square, Dirichlet, exponential, F, Gamma, geometric, or Gumbel.

The following table will list some distribution functions and give examples for generating random numbers:

Function

Description

Example

binomial

Draw samples from a binomial distribution (n: number of trials, p: probability)

>>> n, p = 100, 0.2
>>> np.random.binomial(n, p, 3)
array([17, 14, 23])

dirichlet

Draw samples using a Dirichlet distribution

>>> np.random.dirichlet(alpha=(2,3), size=3)
array([[0.519, 0.480], [0.639, 0.36],
 [0.838, 0.161]])

poisson

Draw samples from a Poisson distribution

>>> np.random.poisson(lam=2, size= 2)
array([4,1])

normal

Draw samples using a normal Gaussian distribution

>>> np.random.normal
(loc=2.5, scale=0.3, size=3)
array([2.4436, 2.849, 2.741)

uniform

Draw samples using a uniform distribution

>>> np.random.uniform(
low=0.5, high=2.5, size=3)
array([1.38, 1.04, 2.19[)

We can also use the random number generation to shuffle items in a list. Sometimes this is useful when we want to sort a list in a random order:

The following figure shows two distributions, binomial and poisson , side by side with various parameters (the visualization was created with matplotlib, which will be covered in Chapter 4, Data Visualization):

NumPy random numbers

In this chapter, we covered a lot of information related to the NumPy package, especially commonly used functions that are very helpful to process and analyze data in ndarray. Firstly, we learned the properties and data type of ndarray in the NumPy package. Secondly, we focused on how to create and manipulate an ndarray in different ways, such as conversion from other structures, reading an array from disk, or just generating a new array with given values. Thirdly, we studied how to access and control the value of each element in ndarray by using indexing and slicing.

Then, we are getting familiar with some common functions and operations on ndarray.

And finally, we continue with some advance functions that are related to statistic, linear algebra and sampling data. Those functions play important role in data analysis.

However, while NumPy by itself does not provide very much high-level data analytical functionality, having an understanding of it will help you use tools such as Pandas much more effectively. This tool will be discussed in the next chapter.

Practice exercises

Exercise 1: Using an array creation function, let's try to create arrays variable in the following situations:

Exercise 2: What is the difference between np.dot(a, b) and (a*b)?

Exercise 3: Consider the vector [1, 2, 3, 4, 5] building a new vector with four consecutive zeros interleaved between each value.

Exercise 4: Taking the data example file chapter2-data.txt, which includes information on a system log, solves the following tasks:

 

Let's first get acquainted with two of Pandas' primary data structures: the Series and the DataFrame. They can handle the majority of use cases in finance, statistic, social science, and many areas of engineering.

A Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:

By default, if no index is passed, it will be created to have values ranging from 0 to N-1, where N is the length of the Series:

We can access the value of a Series by using the index:

This accessing method is similar to a Python dictionary. Therefore, Pandas also allows us to initialize a Series object directly from a Python dictionary:

Sometimes, we want to filter or rename the index of a Series created from a Python dictionary. At such times, we can pass the selected index list directly to the initial function, similarly to the process in the above example. Only elements that exist in the index list will be in the Series object. Conversely, indexes that are missing in the dictionary are initialized to default NaN values by Pandas:

The library also supports functions that detect missing data:

Similarly, we can also initialize a Series from a scalar value:

A Series object can be initialized with NumPy objects as well, such as ndarray. Moreover, Pandas can automatically align data indexed in different ways in arithmetic operations:

The DataFrame is a tabular data structure comprising a set of ordered columns and rows. It can be thought of as a group of Series objects that share an index (the column names). There are a number of ways to initialize a DataFrame object. Firstly, let's take a look at the common example of creating DataFrame from a dictionary of lists:

By default, the DataFrame constructor will order the column alphabetically. We can edit the default order by passing the column's attribute to the initializing function:

We can provide the index labels of a DataFrame similar to a Series:

We can construct a DataFrame out of nested lists as well:

Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:

To modify or append a new column to the created DataFrame, we specify the column name and the value we want to assign:

Using a couple of methods, rows can be retrieved by position or name:

A DataFrame object can also be created from different data structures such as a list of dictionaries, a dictionary of Series, or a record array. The method to initialize a DataFrame object is similar to the examples above.

Another common case is to provide a DataFrame with data from a location such as a text file. In this situation, we use the read_csv function that expects the column separator to be a comma, by default. However, we can change that by using the sep parameter:

While reading a data file, we sometimes want to skip a line or an invalid value. As for Pandas 0.16.2, read_csv supports over 50 parameters for controlling the loading process. Some common useful parameters are as follows:

Moreover, Pandas also has support for reading and writing a DataFrame directly from or to a database such as the read_frame or write_frame function within the Pandas module. We will come back to these methods later in this chapter.

Series

A

Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:

By default, if no index is passed, it will be created to have values ranging from 0 to N-1, where N is the length of the Series:

We can access the value of a Series by using the index:

This accessing method is similar to a Python dictionary. Therefore, Pandas also allows us to initialize a Series object directly from a Python dictionary:

Sometimes, we want to filter or rename the index of a Series created from a Python dictionary. At such times, we can pass the selected index list directly to the initial function, similarly to the process in the above example. Only elements that exist in the index list will be in the Series object. Conversely, indexes that are missing in the dictionary are initialized to default NaN values by Pandas:

The library also supports functions that detect missing data:

Similarly, we can also initialize a Series from a scalar value:

A Series object can be initialized with NumPy objects as well, such as ndarray. Moreover, Pandas can automatically align data indexed in different ways in arithmetic operations:

The DataFrame is a tabular data structure comprising a set of ordered columns and rows. It can be thought of as a group of Series objects that share an index (the column names). There are a number of ways to initialize a DataFrame object. Firstly, let's take a look at the common example of creating DataFrame from a dictionary of lists:

By default, the DataFrame constructor will order the column alphabetically. We can edit the default order by passing the column's attribute to the initializing function:

We can provide the index labels of a DataFrame similar to a Series:

We can construct a DataFrame out of nested lists as well:

Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:

To modify or append a new column to the created DataFrame, we specify the column name and the value we want to assign:

Using a couple of methods, rows can be retrieved by position or name:

A DataFrame object can also be created from different data structures such as a list of dictionaries, a dictionary of Series, or a record array. The method to initialize a DataFrame object is similar to the examples above.

Another common case is to provide a DataFrame with data from a location such as a text file. In this situation, we use the read_csv function that expects the column separator to be a comma, by default. However, we can change that by using the sep parameter:

While reading a data file, we sometimes want to skip a line or an invalid value. As for Pandas 0.16.2, read_csv supports over 50 parameters for controlling the loading process. Some common useful parameters are as follows:

Moreover, Pandas also has support for reading and writing a DataFrame directly from or to a database such as the read_frame or write_frame function within the Pandas module. We will come back to these methods later in this chapter.

The DataFrame

The

DataFrame is a tabular data structure comprising a set of ordered columns and rows. It can be thought of as a group of Series objects that share an index (the column names). There are a number of ways to initialize a DataFrame object. Firstly, let's take a look at the common example of creating DataFrame from a dictionary of lists:

By default, the DataFrame constructor will order the column alphabetically. We can edit the default order by passing the column's attribute to the initializing function:

We can provide the index labels of a DataFrame similar to a Series:

We can construct a DataFrame out of nested lists as well:

Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:

To modify or append a new column to the created DataFrame, we specify the column name and the value we want to assign:

Using a couple of methods, rows can be retrieved by position or name:

A DataFrame object can also be created from different data structures such as a list of dictionaries, a dictionary of Series, or a record array. The method to initialize a DataFrame object is similar to the examples above.

Another common case is to provide a DataFrame with data from a location such as a text file. In this situation, we use the read_csv function that expects the column separator to be a comma, by default. However, we can change that by using the sep parameter:

While reading a data file, we sometimes want to skip a line or an invalid value. As for Pandas 0.16.2, read_csv supports over 50 parameters for controlling the loading process. Some common useful parameters are as follows:

Moreover, Pandas also has support for reading and writing a DataFrame directly from or to a database such as the read_frame or write_frame function within the Pandas module. We will come back to these methods later in this chapter.

Pandas supports many essential functionalities that are useful to manipulate Pandas data structures. In this book, we will focus on the most important features regarding exploration and analysis.

The supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum information of df5, which is a DataFrame object:

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

Here, we have a summary table for common supported statistics functions in Pandas:

Function

Description

idxmin(axis), idxmax(axis)

This compute the index labels with the minimum or maximum corresponding values.

value_counts()

This compute the frequency of unique values.

count()

This return the number of non-null values in a data object.

mean(), median(), min(), max()

This return mean, median, minimum, and maximum values of an axis in a data object.

std(), var(), sem()

These return the standard deviation, variance, and standard error of mean.

abs()

This gets the absolute value of a data object.

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

Firstly, we will consider methods for sorting by row and column index. In this case, we have the sort_index () function. We also have axis parameter to set whether the function should sort by row or column. The ascending option with the true or false value will allow us to sort data in ascending or descending order. The default setting for this option is true:

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Reindexing and altering labels

Reindex

The supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum information of df5, which is a DataFrame object:

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

Here, we have a summary table for common supported statistics functions in Pandas:

Function

Description

idxmin(axis), idxmax(axis)

This compute the index labels with the minimum or maximum corresponding values.

value_counts()

This compute the frequency of unique values.

count()

This return the number of non-null values in a data object.

mean(), median(), min(), max()

This return mean, median, minimum, and maximum values of an axis in a data object.

std(), var(), sem()

These return the standard deviation, variance, and standard error of mean.

abs()

This gets the absolute value of a data object.

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

Firstly, we will consider methods for sorting by row and column index. In this case, we have the sort_index () function. We also have axis parameter to set whether the function should sort by row or column. The ascending option with the true or false value will allow us to sort data in ascending or descending order. The default setting for this option is true:

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Head and tail

In common

The supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum information of df5, which is a DataFrame object:

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

Here, we have a summary table for common supported statistics functions in Pandas:

Function

Description

idxmin(axis), idxmax(axis)

This compute the index labels with the minimum or maximum corresponding values.

value_counts()

This compute the frequency of unique values.

count()

This return the number of non-null values in a data object.

mean(), median(), min(), max()

This return mean, median, minimum, and maximum values of an axis in a data object.

std(), var(), sem()

These return the standard deviation, variance, and standard error of mean.

abs()

This gets the absolute value of a data object.

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

Firstly, we will consider methods for sorting by row and column index. In this case, we have the sort_index () function. We also have axis parameter to set whether the function should sort by row or column. The ascending option with the true or false value will allow us to sort data in ascending or descending order. The default setting for this option is true:

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Binary operations

Firstly, we

The supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum information of df5, which is a DataFrame object:

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

Here, we have a summary table for common supported statistics functions in Pandas:

Function

Description

idxmin(axis), idxmax(axis)

This compute the index labels with the minimum or maximum corresponding values.

value_counts()

This compute the frequency of unique values.

count()

This return the number of non-null values in a data object.

mean(), median(), min(), max()

This return mean, median, minimum, and maximum values of an axis in a data object.

std(), var(), sem()

These return the standard deviation, variance, and standard error of mean.

abs()

This gets the absolute value of a data object.

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

Firstly, we will consider methods for sorting by row and column index. In this case, we have the sort_index () function. We also have axis parameter to set whether the function should sort by row or column. The ascending option with the true or false value will allow us to sort data in ascending or descending order. The default setting for this option is true:

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Functional statistics

The

supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. Pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum information of df5, which is a DataFrame object:

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

Here, we have a summary table for common supported statistics functions in Pandas:

Function

Description

idxmin(axis), idxmax(axis)

This compute the index labels with the minimum or maximum corresponding values.

value_counts()

This compute the frequency of unique values.

count()

This return the number of non-null values in a data object.

mean(), median(), min(), max()

This return mean, median, minimum, and maximum values of an axis in a data object.

std(), var(), sem()

These return the standard deviation, variance, and standard error of mean.

abs()

This gets the absolute value of a data object.

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

Firstly, we will consider methods for sorting by row and column index. In this case, we have the sort_index () function. We also have axis parameter to set whether the function should sort by row or column. The ascending option with the true or false value will allow us to sort data in ascending or descending order. The default setting for this option is true:

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Function application

Pandas

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

Firstly, we will consider methods for sorting by row and column index. In this case, we have the sort_index () function. We also have axis parameter to set whether the function should sort by row or column. The ascending option with the true or false value will allow us to sort data in ascending or descending order. The default setting for this option is true:

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

Sorting

There

are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

Firstly, we will consider methods for sorting by row and column index. In this case, we have the sort_index () function. We also have axis parameter to set whether the function should sort by row or column. The ascending option with the true or false value will allow us to sort data in ascending or descending order. The default setting for this option is true:

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

In this section, we will focus on how to get, set, or slice subsets of Pandas data structure objects. As we learned in previous sections, Series or DataFrame objects have axis labeling information. This information can be used to identify items that we want to select or assign a new value to in the object:

If the data object is a DataFrame structure, we can also proceed in a similar way:

For label indexing on the rows of DataFrame, we use the ix function that enables us to select a set of rows and columns in the object. There are two parameters that we need to specify: the row and column labels that we want to get. By default, if we do not specify the selected column names, the function will return selected rows with all columns in the object:

Moreover, we have many ways to select and edit data contained in a Pandas object. We summarize these functions in the following table:

Method

Description

icol, irow

This selects a single row or column by integer location.

get_value, set_value

This selects or sets a single value of a data object by row or column label.

xs

This selects a single column or row as a Series by label.

Let's start with correlation and covariance computation between two data objects. Both the Series and DataFrame have a cov method. On a DataFrame object, this method will compute the covariance between the Series inside the object:

Usage of the correlation method is similar to the covariance method. It computes the correlation between Series inside a data object in case the data object is a DataFrame. However, we need to specify which method will be used to compute the correlations. The available methods are pearson, kendall, and spearman. By default, the function applies the spearman method:

We also have the corrwith function that supports calculating correlations between Series that have the same label contained in different DataFrame objects:

In this section, we will discuss missing, NaN, or null values, in Pandas data structures. It is a very common situation to arrive with missing data in an object. One such case that creates missing data is reindexing:

To manipulate missing values, we can use the isnull() or notnull() functions to detect the missing values in a Series object, as well as in a DataFrame object:

On a Series, we can drop all null data and index values by using the dropna function:

With a DataFrame object, it is a little bit more complex than with Series. We can tell which rows or columns we want to drop and also if all entries must be null or a single null value is enough. By default, the function will drop any row containing a missing value:

Another way to control missing values is to use the supported parameters of functions that we introduced in the previous section. They are also very useful to solve this problem. In our experience, we should assign a fixed value in missing cases when we create data objects. This will make our objects cleaner in later processing steps. For example, consider the following:

We can alse use the fillna function to fill a custom value in missing values:

In this section we will consider some advanced Pandas use cases.

Hierarchical indexing provides us with a way to work with higher dimensional data in a lower dimension by structuring the data object into multiple index levels on an axis:

In the preceding example, we have a Series object that has two index levels. The object can be rearranged into a DataFrame using the unstack function. In an inverse situation, the stack function can be used:

We can also create a DataFrame to have a hierarchical index in both axes:

The methods for getting or setting values or subsets of the data objects with multiple index levels are similar to those of the nonhierarchical case:

After grouping data into multiple index levels, we can also use most of the descriptive and statistics functions that have a level option, which can be used to specify the level we want to process:

The Panel is another data structure for three-dimensional data in Pandas. However, it is less frequently used than the Series or the DataFrame. You can think of a Panel as a table of DataFrame objects. We can create a Panel object from a 3D ndarray or a dictionary of DataFrame objects:

Each item in a Panel is a DataFrame. We can select an item, by item name:

Alternatively, if we want to select data via an axis or data position, we can use the ix method, like on Series or DataFrame:

Hierarchical indexing

Hierarchical indexing

provides us with a way to work with higher dimensional data in a lower dimension by structuring the data object into multiple index levels on an axis:

In the preceding example, we have a Series object that has two index levels. The object can be rearranged into a DataFrame using the unstack function. In an inverse situation, the stack function can be used:

We can also create a DataFrame to have a hierarchical index in both axes:

The methods for getting or setting values or subsets of the data objects with multiple index levels are similar to those of the nonhierarchical case:

After grouping data into multiple index levels, we can also use most of the descriptive and statistics functions that have a level option, which can be used to specify the level we want to process:

The Panel is another data structure for three-dimensional data in Pandas. However, it is less frequently used than the Series or the DataFrame. You can think of a Panel as a table of DataFrame objects. We can create a Panel object from a 3D ndarray or a dictionary of DataFrame objects:

Each item in a Panel is a DataFrame. We can select an item, by item name:

Alternatively, if we want to select data via an axis or data position, we can use the ix method, like on Series or DataFrame:

The Panel data

The Panel is

another data structure for three-dimensional data in Pandas. However, it is less frequently used than the Series or the DataFrame. You can think of a Panel as a table of DataFrame objects. We can create a Panel object from a 3D ndarray or a dictionary of DataFrame objects:

Each item in a Panel is a DataFrame. We can select an item, by item name:

Alternatively, if we want to select data via an axis or data position, we can use the ix method, like on Series or DataFrame:

We have finished covering the basics of the Pandas data analysis library. Whenever you learn about a library for data analysis, you need to consider the three parts that we explained in this chapter. Data structures: we have two common data object types in the Pandas library; Series and DataFrames. Method to access and manipulate data objects: Pandas supports many way to select, set or slice subsets of data object. However, the general mechanism is using index labels or the positions of items to identify values. Functions and utilities: They are the most important part of a powerful library. In this chapter, we covered all common supported functions of Pandas which allow us compute statistics on data easily. The library also has a lot of other useful functions and utilities that we could not explain in this chapter. We encourage you to start your own research, if you want to expand your experience with Pandas. It helps us to process large data in an optimized way. You will see more of Pandas in action later in this book.

Until now, we learned about two popular Python libraries: NumPy and Pandas. Pandas is built on NumPy, and as a result it allows for a bit more convenient interaction with data. However, in some situations, we can flexibly combine both of them to accomplish our goals.

Practice exercises

The link https://www.census.gov/2010census/csv/pop_change.csv contains an US census dataset. It has 23 columns and one row for each US state, as well as a few rows for macro regions such as North, South, and West.

  • Get this dataset into a Pandas DataFrame. Hint: just skip those rows that do not seem helpful, such as comments or description.
  • While the dataset contains change metrics for each decade, we are interested in the population change during the second half of the twentieth century, that is between, 1950 and 2000. Which region has seen the biggest and the smallest population growth in this time span? Also, which US state?

Advanced open-ended exercise:

  • Find more census data on the internet; not just on the US but on the world's countries. Try to find GDP data for the same time as well. Try to align this data to explore patterns. How are GDP and population growth related? Are there any special cases. such as countries with high GDP but low population growth or countries with the opposite history?
 

Data visualization is concerned with the presentation of data in a pictorial or graphical form. It is one of the most important tasks in data analysis, since it enables us to see analytical results, detect outliers, and make decisions for model building. There are many Python libraries for visualization, of which matplotlib, seaborn, bokeh, and ggplot are among the most popular. However, in this chapter, we mainly focus on the matplotlib library that is used by many people in many different contexts.

Matplotlib produces publication-quality figures in a variety of formats, and interactive environments across Python platforms. Another advantage is that Pandas comes equipped with useful wrappers around several matplotlib plotting routines, allowing for quick and handy plotting of Series and DataFrame objects.

The IPython package started as an alternative to the standard interactive Python shell, but has since evolved into an indispensable tool for data exploration, visualization, and rapid prototyping. It is possible to use the graphical capabilities offered by matplotlib from IPython through various options, of which the simplest to get started with is the pylab flag:

This flag will preload matplotlib and numpy for interactive use with the default matplotlib backend. IPython can run in various environments: in a terminal, as a Qt application, or inside a browser. These options are worth exploring, since IPython has enjoyed adoption for many use cases, such as prototyping, interactive slides for more engaging conference talks or lectures, and as a tool for sharing research.

The easiest way to get started with plotting using matplotlib is often by using the MATLAB API that is supported by the package:

The output for the preceding command is as follows:

The matplotlib API primer

However, star imports should not be used unless there is a good reason for doing so. In the case of matplotlib, we can use the canonical import:

The preceding example could then be written as follows:

The output for the preceding command is as follows:

The matplotlib API primer

If we only provide a single argument to the plot function, it will automatically use it as the y values and generate the x values from 0 to N-1, where N is equal to the number of values:

The output for the preceding command is as follows:

The matplotlib API primer

By default, the range of the axes is constrained by the range of the input x and y data. If we want to specify the viewport of the axes, we can use the axis() method to set custom ranges. For example, in the previous visualization, we could increase the range of the x axis from [0, 5] to [0, 6], and that of the y axis from [0, 9] to [0, 10], by writing the following command:

By default, all plotting commands apply to the current figure and axes. In some situations, we want to visualize data in multiple figures and axes to compare different plots or to use the space on a page more efficiently. There are two steps required before we can plot the data. Firstly, we have to define which figure we want to plot. Secondly, we need to figure out the position of our subplot in the figure:

The output for the preceding command is as follows:

Figures and subplots

In this case, we currently have the figure a. If we want to modify any subplot in figure a, we first call the command to select the figure and subplot, and then execute the function to modify the subplot. Here, for example, we change the title of the second plot of our four-plot figure:

The output for the preceding command is as follows:

Figures and subplots

There is a convenience method, plt.subplots(), to creating a figure that contains a given number of subplots. As inthe previous example, we can use the plt.subplots(2,2) command to create a 2x2 figure that consists of four subplots.

We can also create the axes manually, instead of rectangular grid, by using the plt.axes([left, bottom, width, height]) command, where all input parameters are in the fractional [0, 1] coordinates:

The output for the preceding command is as follows:

Figures and subplots

However, when you manually create axes, it takes more time to balance coordinates and sizes between subplots to arrive at a well-proportioned figure.

We have looked at how to create simple line plots so far. The matplotlib library supports many more plot types that are useful for data visualization. However, our goal is to provide the basic knowledge that will help you to understand and use the library for visualizing data in the most common situations. Therefore, we will only focus on four kinds of plot types: scatter plots, bar plots, contour plots, and histograms.

Scatter plots

A
Bar plots

A
Contour plots

We use Histogram plots

A

Legends are an important element that is used to identify the plot elements in a figure. The easiest way to show a legend inside a figure is to use the label argument of the plot function, and show the labels by calling the plt.legend() method:

The output for the preceding command as follows:

Legends and annotations

The loc argument in the legend command is used to figure out the position of the label box. There are several valid location options: lower left, right, upper left, lower center, upper right, center, lower right, upper right, center right, best, upper center, and center left. The default position setting is upper right. However, when we set an invalid location option that does not exist in the above list, the function automatically falls back to the best option.

If we want to split the legend into multiple boxes in a figure, we can manually set our expected labels for plot lines, as shown in the following image:

Legends and annotations

The output for the preceding command is as follows:

The other element in a figure that we want to introduce is the annotations which can consist of text, arrows, or other shapes to explain parts of the figure in detail, or to emphasize some special data points. There are different methods for showing annotations, such as text, arrow, and annotation.

Here is a simple example to illustrate the annotate and text functions:

The output for the preceding command is as follows:

Legends and annotations

We have covered most of the important components in a plot figure using matplotlib. In this section, we will introduce another powerful plotting method for directly creating standard visualization from Pandas data objects that are often used to manipulate data.

For Series or DataFrame objects in Pandas, most plotting types are supported, such as line, bar, box, histogram, and scatter plots, and pie charts. To select a plot type, we use the kind argument of the plot function. With no kind of plot specified, the plot function will generate a line style visualization by default , as in the following example:

The output for the preceding command is as follows:

Plotting functions with Pandas

Another example will visualize the data of a DataFrame object consisting of multiple columns:

The output for the preceding command is as follows:

Plotting functions with Pandas

The plot method of the DataFrame has a number of options that allow us to handle the plotting of the columns. For example, in the above DataFrame visualization, we chose to plot the columns in separate subplots. The following table lists more options:

Argument

Value

Description

subplots

True/False

The plots each data column in a separate subplot

logy

True/False

The gets a log-scale y axis

secondary_y

True/False

The plots data on a secondary y axis

sharex, sharey

True/False

The shares the same x or y axis, linking sticks and limits

Besides matplotlib, there are other powerful data visualization toolkits based on Python. While we cannot dive deeper into these libraries, we would like to at least briefly introduce them in this session.

Bokeh

Bokeh is
MayaVi

MayaVi

We finished covering most of the basics, such as functions, arguments, and properties for data visualization, based on the matplotlib library. We hope that, through the examples, you will be able to understand and apply them to your own problems. In general, to visualize data, we need to consider five steps- that is, getting data into suitable Python or Pandas data structures, such as lists, dictionaries, Series, or DataFrames. We explained in the previous chapters, how to accomplish this step. The second step is defining plots and subplots for the data object in question. We discussed this in the figures and subplots session. The third step is selecting a plot style and its attributes to show in the subplots such as: line, bar, histogram, scatter plot, line style, and color. The fourth step is adding extra components to the subplots, like legends, annotations and text. The fifth step is displaying or saving the results.

By now, you can do quite a few things with a dataset; for example, manipulation, cleaning, exploration, and visualization based on Python libraries such as Numpy, Pandas, and matplotlib. You can now combine this knowledge and practice with these libraries to get more and more familiar with Python data analysis.

Practice exercises:

 

Time series typically consist of a sequence of data points coming from measurements taken over time. This kind of data is very common and occurs in a multitude of fields.

A business executive is interested in stock prices, prices of goods and services or monthly sales figures. A meteorologist takes temperature measurements several times a day and also keeps records of precipitation, humidity, wind direction and force. A neurologist can use electroencephalography to measure electrical activity of the brain along the scalp. A sociologist can use campaign contribution data to learn about political parties and their supporters and use these insights as an argumentation aid. More examples for time series data can be enumerated almost endlessly.

Python supports date and time handling in the date time and time modules from the standard library:

Sometimes, dates are given or expected as strings, so a conversion from or to strings is necessary, which is realized by two functions: strptime and strftime, respectively:

Real-world data usually comes in all kinds of shapes and it would be great if we did not need to remember the exact date format specifies for parsing. Thankfully, Pandas abstracts away a lot of the friction, when dealing with strings representing dates or time. One of these helper functions is to_datetime:

The last can refer to August 7th or July 8th, depending on the region. To disambiguate this case, to_datetime can be passed a keyword argument dayfirst:

Timestamp objects can be seen as Pandas' version of datetime objects and indeed, the Timestamp class is a subclass of datetime:

Which means they can be used interchangeably in many cases:

Timestamp objects are an important part of time series capabilities of Pandas, since timestamps are the building block of DateTimeIndex objects:

There are a few things to note here: We create a list of timestamp objects and pass it to the series constructor as index. This list of timestamps gets converted into a DatetimeIndex on the fly. If we had passed only the date strings, we would not get a DatetimeIndex, just an index:

However, the to_datetime function is flexible enough to be of help, if all we have is a list of date strings:

Another thing to note is that while we have a DatetimeIndex, the freq and tz attributes are both None. We will learn about the utility of both attributes later in this chapter.

With to_datetime we are able to convert a variety of strings and even lists of strings into timestamp or DatetimeIndex objects. Sometimes we are not explicitly given all the information about a series and we have to generate sequences of time stamps of fixed intervals ourselves.

Pandas offer another great utility function for this task: date_range.

The date_range function helps to generate a fixed frequency datetime index between start and end dates. It is also possible to specify either the start or end date and the number of timestamps to generate.

The frequency can be specified by the freq parameter, which supports a number of offsets. You can use typical time intervals like hours, minutes, and seconds:

The freq attribute allows us to specify a multitude of options. Pandas has been used successfully in finance and economics, not least because it is really simple to work with business dates as well. As an example, to get an index with the first three business days of the millennium, the B offset alias can be used:

The following table shows the available offset aliases and can be also be looked up in the Pandas documentation on time series under http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases:

Alias

Description

B

Business day frequency

C

Custom business day frequency

D

Calendar day frequency

W

Weekly frequency

M

Month end frequency

BM

Business month end frequency

CBM

Custom business month end frequency

MS

Month start frequency

BMS

Business month start frequency

CBMS

Custom business month start frequency

Q

Quarter end frequency

BQ

Business quarter frequency

QS

Quarter start frequency

BQS

Business quarter start frequency

A

Year end frequency

BA

Business year end frequency

AS

Year start frequency

BAS

Business year start frequency

BH

Business hour frequency

H

Hourly frequency

T

Minutely frequency

S

Secondly frequency

L

Milliseconds

U

Microseconds

N

Nanoseconds

Moreover, the offset aliases can be used in combination as well. Here, we are generating a datetime index with five elements, each one day, one hour, one minute and one second apart:

If we want to index data every 12 hours of our business time, which by default starts at 9 AM and ends at 5 PM, we would simply prefix the BH alias:

A custom definition of what a business hour means is also possible:

We can use this custom business hour to build indexes as well:

Some frequencies allow us to specify an anchoring suffix, which allows us to express intervals, such as every Friday or every second Tuesday of the month:

Finally, we can merge various indexes of different frequencies. The possibilities are endless. We only show one example, where we combine two indexes – each over a decade – one pointing to every first business day of a year and one to the last day of February:

We see, that 2000 and 2005 did not start on a weekday and that 2000, 2004, and 2008 were the leap years.

We have seen two powerful functions so far, to_datetime and date_range. Now we want to dive into time series by first showing how you can create and plot time series data with only a few lines. In the rest of this section, we will show various ways to access and slice time series data.

It is easy to get started with time series data in Pandas. A random walk can be created and plotted in a few lines:

A possible output of this plot is show in the following figure:

Working with date and time objects

Just as with usual series objects, you can select parts and slice the index:

We can use date strings as keys, even though our series has a DatetimeIndex:

Even though the DatetimeIndex is made of timestamp objects, we can use datetime objects as keys as well:

Access is similar to lookup in dictionaries or lists, but more powerful. We can, for example, slice with strings or even mixed objects:

It is even possible to use partial strings to select groups of entries. If we are only interested in February, we could simply write:

To see all entries from March until May, including:

Time series can be shifted forward or backward in time. The index stays in place, the values move:

To shift backwards in time, we simply use negative values:

Downsampling reduces the number of samples in the data. During this reduction, we are able to apply aggregations over data points. Let's imagine a busy airport with thousands of people passing through every hour. The airport administration has installed a visitor counter in the main area, to get an impression of exactly how busy their airport is.

They are receiving data from the counter device every minute. Here are the hypothetical measurements for a day, beginning at 08:00, ending 600 minutes later at 18:00:

To get a better picture of the day, we can downsample this time series to larger intervals, for example, 10 minutes. We can choose an aggregation function as well. The default aggregation is to take all the values and calculate the mean:

In our airport example, we are also interested in the sum of the values, that is, the combined number of visitors for a given time frame. We can choose the aggregation function by passing a function or a function name to the how parameter works:

Or we can reduce the sampling interval even more by resampling to an hourly interval:

We can ask for other things as well. For example, what was the maximum number of people that passed through our airport within one hour:

Or we can define a custom function if we are interested in more unusual metrics. For example, we could be interested in selecting a random sample for each hour:

If you specify a function by string, Pandas uses highly optimized versions.

The built-in functions that can be used as argument to how are: sum, mean, std, sem, max, min, median, first, last, ohlc. The ohlc metric is popular in finance. It stands for open-high-low-close. An OHLC chart is a typical way to illustrate movements in the price of a financial instrument over time.

While in our airport this metric might not be that valuable, we can compute it nonetheless:

In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.

Let's start with hourly data for a single day:

If we upsample to data points taken every 15 minutes, our time series will be extended with NaN values:

There are various ways to deal with missing values, which can be controlled by the fill_method keyword argument to resample. Values can be filled either forward or backward:

With the limit parameter, it is possible to control the number of missing values to be filled:

If you want to adjust the labels during resampling, you can use the loffset keyword argument:

There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.

We can ask Pandas to interpolate a time series for us:

We saw the default interpolate method – a linear interpolation – in action. Pandas assumes a linear relationship between two existing points.

Pandas supports over a dozen interpolation functions, some of which require the scipy library to be installed. We will not cover interpolation methods in this chapter, but we encourage you to explore the various methods yourself. The right interpolation method will depend on the requirements of your application.

While, by default, Pandas objects are time zone unaware, many real-world applications will make use of time zones. As with working with time in general, time zones are no trivial matter: do you know which countries have daylight saving time and do you know when the time zone is switched in those countries? Thankfully, Pandas builds on the time zone capabilities of two popular and proven utility libraries for time and date handling: pytz and dateutil:

To supply time zone information, you can use the tz keyword argument:

This works for ranges as well:

Time zone objects can also be constructed beforehand:

Sometimes, you will already have a time zone unaware time series object that you would like to make time zone aware. The tz_localize function helps to switch between time zone aware and time zone unaware objects:

To move a time zone aware object to other time zones, you can use the tz_convert method:

Finally, to detach any time zone information from an object, it is possible to pass None to either tz_convert or tz_localize:

Along with the powerful timestamp object, which acts as a building block for the DatetimeIndex, there is another useful data structure, which has been introduced in Pandas 0.15 – the Timedelta. The Timedelta can serve as a basis for indices as well, in this case a TimedeltaIndex.

Timedeltas are differences in times, expressed in difference units. The Timedelta class in Pandas is a subclass of datetime.timedelta from the Python standard library. As with other Pandas data structures, the Timedelta can be constructed from a variety of inputs:

As you would expect, Timedeltas allow basic arithmetic:

Similar to to_datetime, there is a to_timedelta function that can parse strings or lists of strings into Timedelta structures or TimedeltaIndices:

Instead of absolute dates, we could create an index of timedeltas. Imagine measurements from a volcano, for example. We might want to take measurements but index it from a given date, for example the date of the last eruption. We could create a timedelta index that has the last seven days as entries:

We could then work with time series data, indexed from the last eruption. If we had measurements for many eruptions (from possibly multiple volcanos), we would have an index that would make comparisons and analysis of this data easier. For example, we could ask whether there is a typical pattern that occurs between the third day and the fifth day after an eruption. This question would not be impossible to answer with a DatetimeIndex, but a TimedeltaIndex makes this kind of exploration much more convenient.

Pandas comes with great support for plotting, and this holds true for time series data as well.

As a first example, let's take some monthly data and plot it:

Since matplotlib is used under the hood, we can pass a familiar parameter to plot, such as c for color, or title for the chart title:

The following figure shows an example time series plot:

Time series plotting

We can overlay an aggregate plot over 2 and 5 years:

The following figure shows the resampled 2-year plot:

Time series plotting

The following figure shows the resample 5-year plot:

Time series plotting

We can pass the kind of chart to the plot method as well. The return value of the plot method is an AxesSubplot, which allows us to customize many aspects of the plot. Here we are setting the label values on the X axis to the year values from our time series:

Time series plotting

Let's imagine we have four time series that we would like to plot simultaneously. We generate a matrix of 1000 × 4 random values and treat each column as a separated time series:

Time series plotting

In this chapter we showed how you can work with time series in Pandas. We introduced two index types, the DatetimeIndex and the TimedeltaIndex and explored their building blocks in depth. Pandas comes with versatile helper functions that take much of the pain out of parsing dates of various formats or generating fixed frequency sequences. Resampling data can help get a more condensed picture of the data, or it can help align various datasets of different frequencies to one another. One of the explicit goals of Pandas is to make it easy to work with missing data, which is also relevant in the context of upsampling.

Finally, we showed how time series can be visualized. Since matplotlib and Pandas are natural companions, we discovered that we can reuse our previous knowledge about matplotlib for time series data as well.

In the next chapter, we will explore ways to load and store data in text files and databases.

Practice exercises

Exercise 1: Find one or two real-world examples for data sets, which could – in a sensible way – be assigned to the following groups:

Create various fixed frequency ranges:

 

Data analysis starts with data. It is therefore beneficial to work with data storage systems that are simple to set up, operate and where the data access does not become a problem in itself. In short, we would like to have database systems that are easy to embed into our data analysis processes and workflows. In this book, we focus mostly on the Python side of the database interaction, and we will learn how to get data into and out of Pandas data structures.

There are numerous ways to store data. In this chapter, we are going to learn to interact with three main categories: text formats, binary formats and databases. We will focus on two storage solutions, MongoDB and Redis. MongoDB is a document-oriented database, which is easy to start with, since we can store JSON documents and do not need to define a schema upfront. Redis is a popular in-memory data structure store on top of which many applications can be built. It is possible to use Redis as a fast key-value store, but Redis supports lists, sets, hashes, bit arrays and even advanced data structures such as HyperLogLog out of the box as well.

Text is a great medium and it's a simple way to exchange information. The following statement is taken from a quote attributed to Doug McIlroy: Write programs to handle text streams, because that is the universal interface.

In this section we will start reading and writing data from and to text files.

Normally, the raw data logs of a system are stored in multiple text files, which can accumulate a large amount of information over time. Thankfully, it is simple to interact with these kinds of files in Python.

Pandas supports a number of functions for reading data from a text file into a DataFrame object. The most simple one is the read_csv() function. Let's start with a small example file:

In the above example file, each column is separated by comma and the first row is a header row, containing column names. To read the data file into the DataFrame object, we type the following command:

We see that the read_csv function uses a comma as the default delimiter between columns in the text file and the first row is automatically used as a header for the columns. If we want to change this setting, we can use the sep parameter to change the separated symbol and set header=None in case the example file does not have a caption row.

See the below example:

We can also set a specific row as the caption row by using the header that's equal to the index of the selected row. Similarly, when we want to use any column in the data file as the column index of DataFrame, we set index_col to the name or index of the column. We again use the second data file example_data/ex_06-02.txt to illustrate this:

Apart from those parameters, we still have a lot of useful ones that can help us load data files into Pandas objects more effectively. The following table shows some common parameters:

Parameter

Value

Description

dtype

Type name or dictionary of type of columns

Sets the data type for data or columns. By default it will try to infer the most appropriate data type.

skiprows

List-like or integer

The number of lines to skip (starting from 0).

na_values

List-like or dict, default None

Values to recognize as NA/NaN. If a dict is passed, this can be set on a per-column basis.

true_values

List

A list of values to be converted to Boolean True as well.

false_values

List

A list of values to be converted to Boolean False as well.

keep_default_na

Bool, default True

If the na_values parameter is present and keep_default_na is False, the default NaN values are ignored, otherwise they are appended to

thousands

Str, default None

The thousands separator

nrows

Int, default None

Limits the number of rows to read from the file.

error_bad_lines

Boolean, default True

If set to True, a DataFrame is returned, even if an error occurred during parsing.

Besides the read_csv() function, we also have some other parsing functions in Pandas:

Function

Description

read_table

Read the general delimited file into DataFrame

read_fwf

Read a table of fixed-width formatted lines into DataFrame

read_clipboard

Read text from the clipboard and pass to read_table. It is useful for converting tables from web pages

In some situations, we cannot automatically parse data files from the disk using these functions. In that case, we can also open files and iterate through the reader, supported by the CSV module in the standard library:

We can read and write binary serialization of Python objects with the pickle module, which can be found in the standard library. Object serialization can be useful, if you work with objects that take a long time to create, like some machine learning models. By pickling such objects, subsequent access to this model can be made faster. It also allows you to distribute Python objects in a standardized way.

Pandas includes support for pickling out of the box. The relevant methods are the read_pickle() and to_pickle() functions to read and write data from and to files easily. Those methods will write data to disk in the pickle format, which is a convenient short-term storage format:

HDF5 is not a database, but a data model and file format. It is suited for write-one, read-many datasets. An HDF5 file includes two kinds of objects: data sets, which are array-like collections of data, and groups, which are folder-like containers what hold data sets and other groups. There are some interfaces for interacting with HDF5 format in Python, such as h5py which uses familiar NumPy and Python constructs, such as dictionaries and NumPy array syntax. With h5py, we have high-level interface to the HDF5 API which helps us to get started. However, in this book, we will introduce another library for this kind of format called PyTables, which works well with Pandas objects:

We created an empty HDF5 file, named hdf5_store.h5. Now, we can write data to the file just like adding key-value pairs to a dict:

Objects stored in the HDF5 file can be retrieved by specifying the object keys:

Once we have finished interacting with the HDF5 file, we close it to release the file handle:

There are other supported functions that are useful for working with the HDF5 format. You should explore ,in more detail, two libraries – pytables and h5py – if you need to work with huge quantities of data.

HDF5

HDF5 is

not a database, but a data model and file format. It is suited for write-one, read-many datasets. An HDF5 file includes two kinds of objects: data sets, which are array-like collections of data, and groups, which are folder-like containers what hold data sets and other groups. There are some interfaces for interacting with HDF5 format in Python, such as h5py which uses familiar NumPy and Python constructs, such as dictionaries and NumPy array syntax. With h5py, we have high-level interface to the HDF5 API which helps us to get started. However, in this book, we will introduce another library for this kind of format called PyTables, which works well with Pandas objects:

We created an empty HDF5 file, named hdf5_store.h5. Now, we can write data to the file just like adding key-value pairs to a dict:

Objects stored in the HDF5 file can be retrieved by specifying the object keys:

Once we have finished interacting with the HDF5 file, we close it to release the file handle:

There are other supported functions that are useful for working with the HDF5 format. You should explore ,in more detail, two libraries – pytables and h5py – if you need to work with huge quantities of data.

Many applications require more robust storage systems then text files, which is why many applications use databases to store data. There are many kinds of databases, but there are two broad categories: relational databases, which support a standard declarative language called SQL, and so called NoSQL databases, which are often able to work without a predefined schema and where a data instance is more properly described as a document, rather as a row.

MongoDB is a kind of NoSQL database that stores data as documents, which are grouped together in collections. Documents are expressed as JSON objects. It is fast and scalable in storing, and also flexible in querying, data. To use MongoDB in Python, we need to import the pymongo package and open a connection to the database by passing a hostname and port. We suppose that we have a MongoDB instance, running on the default host (localhost) and port (27017):

If we do not put any parameters into the pymongo.MongoClient() function, it will automatically use the default host and port.

In the next step, we will interact with databases inside the MongoDB instance. We can list all databases that are available in the instance:

The above snippet says that our MongoDB instance only has one database, named 'local'. If the databases and collections we point to do not exist, MongoDB will create them as necessary:

Each database contains groups of documents, called collections. We can understand them as tables in a relational database. To list all existing collections in a database, we use collection_names() function:

Our db database does not have any collections yet. Let's create a collection, named person, and insert data from a DataFrame object to it:

The df_ex2 is transposed and converted to a JSON string before loading into a dictionary. The insert() function receives our created dictionary from df_ex2 and saves it to the collection.

If we want to list all data inside the collection, we can execute the following commands:

If we want to query data from the created collection with some conditions, we can use the find() function and pass in a dictionary describing the documents we want to retrieve. The returned result is a cursor type, which supports the iterator protocol:

Sometimes, we want to delete data in MongdoDB. All we need to do is to pass a query to the remove() method on the collection:

We learned step by step how to insert, query and delete data in a collection. Now, we will show how to update existing data in a collection in MongoDB:

The following table shows methods that provide shortcuts to manipulate documents in MongoDB:

Update Method

Description

inc()

Increment a numeric field

set()

Set certain fields to new values

unset()

Remove a field from the document

push()

Append a value onto an array in the document

pushAll()

Append several values onto an array in the document

addToSet()

Add a value to an array, only if it does not exist

pop()

Remove the last value of an array

pull()

Remove all occurrences of a value from an array

pullAll()

Remove all occurrences of any set of values from an array

rename()

Rename a field

bit()

Update a value by bitwise operation

Redis is an advanced kind of key-value store where the values can be of different types: string, list, set, sorted set or hash. Redis stores data in memory like memcached but it can be persisted on disk, unlike memcached, which has no such option. Redis supports fast reads and writes, in the order of 100,000 set or get operations per second.

To interact with Redis, we need to install the Redis-py module to Python, which is available on pypi and can be installed with pip:

Now, we can connect to Redis via the host and port of the DB server. We assume that we have already installed a Redis server, which is running with the default host (localhost) and port (6379) parameters:

As a first step to storing data in Redis, we need to define which kind of data structure is suitable for our requirements. In this section, we will introduce four commonly used data structures in Redis: simple value, list, set and ordered set. Though data is stored into Redis in many different data structures, each value must be associated with a key.

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

The simple value

This is the

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

List

We have a few

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

Set

This

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

Ordered set

The ordered set

data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

We finished covering the basics of interacting with data in different commonly used storage mechanisms from the simple ones, such as text files, over more structured ones, such as HDF5, to more sophisticated data storage systems, such as MongoDB and Redis. The most suitable type of storage will depend on your use case. The choice of the data storage layer technology plays an important role in the overall design of data processing systems. Sometimes, we need to combine various database systems to store our data, such as complexity of the data, performance of the system or computation requirements.

Practice exercises

 

In this chapter, we want to get you acquainted with typical data preparation tasks and analysis techniques, because being fluent in preparing, grouping, and reshaping data is an important building block for successful data analysis.

While preparing data seems like a mundane task – and often it is – it is a step we cannot skip, although we can strive to simplify it by using tools such as Pandas.

Why is preparation necessary at all? Because most useful data will come from the real world and will have deficiencies, contain errors or will be fragmentary.

There are more reasons why data preparation is useful: it gets you in close contact with the raw material. Knowing your input helps you to spot potential errors early and build confidence in your results.

Here are a few data preparation scenarios:

The arsenal of tools for data munging is huge, and while we will focus on Python we want to mention some useful tools as well. If they are available on your system and you expect to work a lot with data, they are worth learning.

One group of tools belongs to the UNIX tradition, which emphasizes text processing and as a consequence has, over the last four decades, developed many high-performance and battle-tested tools for dealing with text. Some common tools are: sed, grep, awk, sort, uniq, tr, cut, tail, and head. They do very elementary things, such as filtering out lines (grep) or columns (cut) from files, replacing text (sed, tr) or displaying only parts of files (head, tail).

We want to demonstrate the power of these tools with a single example only.

Imagine you are handed the log files of a web server and you are interested in the distribution of the IP addresses.

Each line of the log file contains an entry in the common log server format (you can download this data set from http://ita.ee.lbl.gov/html/contrib/EPA-HTTP.html):

$ cat epa-html.txt
wpbfl2-45.gate.net [29:23:56:12] "GET /Access/ HTTP/1.0" 200 2376ebaca.icsi.net [30:00:22:20] "GET /Info.html HTTP/1.0" 200 884

For instance, we want to know how often certain users have visited our site.

We are interested in the first column only, since this is where the IP address or hostname can be found. After that, we need to count the number of occurrences of each host and finally display the results in a friendly way.

The sort | uniq -c stanza is our workhorse here: it sorts the data first and uniq -c will save the number of occurrences along with the value. The sort -nr | head -15 is our formatting part; we sort numerically (-n) and in reverse (-r), and keep only the top 15 entries.

Putting it all together with pipes:

$ cut -d ' ' -f 1 epa-http.txt | sort | uniq -c | sort -nr | head -15
294 sandy.rtptok1.epa.gov
292 e659229.boeing.com
266 wicdgserv.wic.epa.gov
263 keyhole.es.dupont.com
248 dwilson.pr.mcs.net
176 oea4.r8stw56.epa.gov
174 macip26.nacion.co.cr
172 dcimsd23.dcimsd.epa.gov
167 www-b1.proxy.aol.com
158 piweba3y.prodigy.com
152 wictrn13.dcwictrn.epa.gov
151 nntp1.reach.com
151 inetg1.arco.com
149 canto04.nmsu.edu
146 weisman.metrokc.gov

With one command, we get to convert a sequential server log into an ordered list of the most common hosts that visited our site. We also see that we do not seem to have large differences in the number of visits among our top users.

There are more little helpful tools of which the following are just a tiny selection:

Where the UNIX command line ends, lightweight languages take over. You might be able to get an impression from text only, but your colleagues might appreciate visual representations, such as charts or pretty graphs, generated by matplotlib, much more.

Python and its data tools ecosystem are much more versatile than the command line, but for first explorations and simple operations the effectiveness of the command line is often unbeatable.

Most real-world data will have some defects and therefore will need to go through a cleaning step first. We start with a small file. Although this file contains only four rows, it will allow us to demonstrate the process up to a cleaned data set:

Note that this file has a few issues. The lines that contain values are all comma-separated, but we have missing (NA) and probably unclean (5.3*) values. We can load this file into a data frame, nevertheless:

Pandas used the first row as header, but this is not what we want:

This is better, but instead of numeric values, we would like to supply our own column names:

The age column looks good, since Pandas already inferred the intended type, but the height cannot be parsed into numeric values yet:

If we try to coerce the height column into float values, Pandas will report an exception:

We could use whatever value is parseable as a float and throw away the rest with the convert_objects method:

If we know in advance the undesirable characters in our data set, we can augment the read_csv method with a custom converter function:

Now we can finally make the height column a bit more useful. We can assign it the updated version, which has the favored type:

If we wanted to only keep the complete entries, we could drop any row that contains undefined values:

We could use a default height, maybe a fixed value:

On the other hand, we could also use the average of the existing values:

The last three data frames are complete and correct, depending on your definition of correct when dealing with missing values. Especially, the columns have the requested types and are ready for further analysis. Which of the data frames is best suited will depend on the task at hand.

Even if we have clean and probably correct data, we might want to use only parts of it or we might want to check for outliers. An outlier is an observation point that is distant from other observations because of variability or measurement errors. In both cases, we want to reduce the number of elements in our data set to make it more relevant for further processing.

In this example, we will try to find potential outliers. We will use the Europe Brent Crude Oil Spot Price as recorded by the U.S. Energy Information Administration. The raw Excel data is available from http://www.eia.gov/dnav/pet/hist_xls/rbrted.xls (it can be found in the second worksheet). We cleaned the data slightly (the cleaning process is part of an exercise at the end of this chapter) and will work with the following data frame, containing 7160 entries, ranging from 1987 to 2015:

>>> df.head()
        date  price
0 1987-05-20  18.63
1 1987-05-21  18.45
2 1987-05-22  18.55
3 1987-05-25  18.60
4 1987-05-26  18.63
>>> df.tail()
           date  price
7155 2015-08-04  49.08
7156 2015-08-05  49.04
7157 2015-08-06  47.80
7158 2015-08-07  47.54
7159 2015-08-10  48.30

While many people know about oil prices – be it from the news or the filling station – let us forget anything we know about it for a minute. We could first ask for the extremes:

>>> df[df.price==df.price.min()]
           date  price
2937 1998-12-10    9.1
>>> df[df.price==df.price.max()]
           date   price
5373 2008-07-03  143.95

Another way to find potential outliers would be to ask for values that deviate most from the mean. We can use the np.abs function to calculate the deviation from the mean first:

>>> np.abs(df.price - df.price.mean())
0       26.17137
1       26.35137
...
7157     2.99863
7158     2.73863  
7159     3.49863

We can now compare this deviation from a multiple – we choose 2.5 – of the standard deviation:

>>> import numpy as np
>>> df[np.abs(df.price - df.price.mean()) > 2.5 * df.price.std()]
       date   price
5354 2008-06-06  132.81
5355 2008-06-09  134.43
5356 2008-06-10  135.24
5357 2008-06-11  134.52
5358 2008-06-12  132.11
5359 2008-06-13  134.29
5360 2008-06-16  133.90
5361 2008-06-17  131.27
5363 2008-06-19  131.84
5364 2008-06-20  134.28
5365 2008-06-23  134.54
5366 2008-06-24  135.37
5367 2008-06-25  131.59
5368 2008-06-26  136.82
5369 2008-06-27  139.38
5370 2008-06-30  138.40
5371 2008-07-01  140.67
5372 2008-07-02  141.24
5373 2008-07-03  143.95
5374 2008-07-07  139.62
5375 2008-07-08  134.15
5376 2008-07-09  133.91
5377 2008-07-10  135.81
5378 2008-07-11  143.68
5379 2008-07-14  142.43
5380 2008-07-15  136.02
5381 2008-07-16  133.31
5382 2008-07-17  134.16

We see that those few days in summer 2008 must have been special. Sure enough, it is not difficult to find articles and essays with titles like Causes and Consequences of the Oil Shock of 2007–08. We have discovered a trace to these events solely by looking at the data.

We could ask the above question for each decade separately. We first make our data frame look more like a time series:

We could filter out the eighties:

We observe that within the data available (1987–1989), the fall of 1988 exhibits a slight spike in the oil prices. Similarly, during the nineties, we see that we have a larger deviation, in the fall of 1990:

There are many more use cases for filtering data. Space and time are typical units: you might want to filter census data by state or city, or economical data by quarter. The possibilities are endless and will be driven by your project.

The situation is common: you have multiple data sources, but in order to make statements about the content, you would rather combine them. Fortunately, Pandas' concatenation and merge functions abstract away most of the pain, when combining, joining, or aligning data. It does so in a highly optimized manner as well.

In a case where two data frames have a similar shape, it might be useful to just append one after the other. Maybe A and B are products and one data frame contains the number of items sold per product in a store:

Sometimes, we won't care about the indices of the originating data frames:

A more flexible way to combine objects is offered by the pd.concat function, which takes an arbitrary number of series, data frames, or panels as input. The default behavior resembles an append:

The default concat operation appends both frames along the rows – or index, which corresponds to axis 0. To concatenate along the columns, we can pass in the axis keyword argument:

We can add keys to create a hierarchical index.

This can be useful if you want to refer back to parts of the data frame later. We use the ix indexer:

Data frames resemble database tables. It is therefore not surprising that Pandas implements SQL-like join operations on them. What is positively surprising is that these operations are highly optimized and extremely fast:

If we merge on key, we get an inner join. This creates a new data frame by combining the column values of the original data frames based upon the join predicate, here the key attribute is used:

A left, right and full join can be specified by the how parameter:

The merge methods can be specified with the how parameter. The following table shows the methods in comparison with SQL:

Merge Method

SQL Join Name

Description

left

LEFT OUTER JOIN

Use keys from the left frame only.

right

RIGHT OUTER JOIN

Use keys from the right frame only.

outer

FULL OUTER JOIN

Use a union of keys from both frames.

inner

INNER JOIN

Use an intersection of keys from both frames.

We saw how to combine data frames but sometimes we have all the right data in a single data structure, but the format is impractical for certain tasks. We start again with some artificial weather data:

If we want to calculate the maximum temperature per city, we could just group the data by city and then take the max function:

However, if we have to bring our data into form every time, we could be a little more effective, by creating a reshaped data frame first, having the dates as an index and the cities as columns.

We can create such a data frame with the pivot function. The arguments are the index (we use date), the columns (we use the cities), and the values (which are stored in the value column of the original data frame):

We can use max function on this new data frame directly:

With a more suitable shape, other operations become easier as well. For example, to find the maximum temperature per day, we can simply provide an additional axis argument:

As a final topic, we will look at ways to get a condensed view of data with aggregations. Pandas comes with a lot of aggregation functions built-in. We already saw the describe function in Chapter 3, Data Analysis with Pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:

>>> df.head()
   country     city        date  hours
0  Germany  Hamburg  2015-06-01      8
1  Germany  Hamburg  2015-06-02     10
2  Germany  Hamburg  2015-06-03      9
3  Germany  Hamburg  2015-06-04      7
4  Germany  Hamburg  2015-06-05      3

To view a summary per city, we use the describe function on the grouped data set:

>>> df.groupby("city").describe()
                      hours
city
Berlin     count  10.000000
           mean    6.000000
           std     3.741657
           min     0.000000
           25%     4.000000
           50%     6.000000
           75%     9.750000
           max    10.000000
Birmingham count  10.000000
           mean    5.100000
           std     2.078995
           min     2.000000
           25%     4.000000
           50%     5.500000
           75%     6.750000
           max     8.000000

On certain data sets, it can be useful to group by more than one attribute. We can get an overview about the sunny hours per country and date by passing in two column names:

>>> df.groupby(["country", "date"]).describe()
                         hours country date
France  2015-06-01 count  5.000000
                   mean   6.200000
                   std    1.095445
                   min    5.000000
                   25%    5.000000
                   50%    7.000000
                   75%    7.000000
                   max    7.000000
        2015-06-02 count  5.000000
                   mean   3.600000
                   std    3.577709
                   min    0.000000
                   25%    0.000000
                   50%    4.000000
                   75%    6.000000
                   max    8.000000
UK      2015-06-07 std    3.872983
                   min    0.000000
                   25%    2.000000
                   50%    6.000000
                   75%    8.000000
                   max    9.000000

We can compute single statistics as well:

Finally, we can define any function to be applied on the groups with the agg method. The above could have been written in terms of agg like this:

But arbitrary functions are possible. As a last example, we define a custom function, which takes an input of a series object and computes the difference between the smallest and the largest element:

One typical workflow during data exploration looks as follows:

In this chapter we have looked at ways to manipulate data frames, from cleaning and filtering, to grouping, aggregation, and reshaping. Pandas makes a lot of the common operations very easy and more complex operations, such as pivoting or grouping by multiple attributes, can often be expressed as one-liners as well. Cleaning and preparing data is an essential part of data exploration and analysis.

The next chapter explains a brief of machine learning algorithms that is applying data analysis result to make decisions or build helpful products.

Practice exercises

Exercise 1: Cleaning: In the section about filtering, we used the Europe Brent Crude Oil Spot Price, which can be found as an Excel document on the internet. Take this Excel spreadsheet and try to convert it into a CSV document that is ready to be imported with Pandas.

Hint: There are many ways to do this. We used a small tool called xls2csv.py and we were able to load the resulting CSV file with a helper method:

Take a data set that is important for your work – or if you do not have any at hand, a data set that interests you and that is available online. Ask one or two questions about the data in advance. Then use cleaning, filtering, grouping, and plotting techniques to answer your question.

 

In the previous chapter, we saw how to perform data munging, data aggregation, and grouping. In this chapter, we will see the working of different scikit-learn modules for different models in brief, data representation in scikit-learn, understand supervised and unsupervised learning using an example, and measure prediction performance.

Machine learning is a subfield of artificial intelligence that explores how machines can learn from data to analyze structures, help with decisions, and make predictions. In 1959, Arthur Samuel defined machine learning as the, "Field of study that gives computers the ability to learn without being explicitly programmed."

A wide range of applications employ machine learning methods, such as spam filtering, optical character recognition, computer vision, speech recognition, credit approval, search engines, and recommendation systems.

One important driver for machine learning is the fact that data is generated at an increasing pace across all sectors; be it web traffic, texts or images, and sensor data or scientific datasets. The larger amounts of data give rise to many new challenges in storage and processing systems. On the other hand, many learning algorithms will yield better results with more data to learn from. The field has received a lot of attention in recent years due to significant performance increases in various hard tasks, such as speech recognition or object detection in images. Understanding large amounts of data without the help of intelligent algorithms seems unpromising.

A learning problem typically uses a set of samples (usually denoted with an N or n) to build a model, which is then validated and used to predict the properties of unseen data.

Each sample might consist of single or multiple values. In the context of machine learning, the properties of data are called features.

Machine learning can be arranged by the nature of the input data:

In supervised learning, the input data (typically denoted with x) is associated with a target label (y), whereas in unsupervised learning, we only have unlabeled input data.

Supervised learning can be further broken down into the following problems:

Classification problems have a fixed set of target labels, classes, or categories, while regression problems have one or more continuous output variables. Classifying e-mail messages as spam or not spam is a classification task with two target labels. Predicting house prices—given the data about houses, such as size, age, and nitric oxides concentration—is a regression task, since the price is continuous.

Unsupervised learning deals with datasets that do not carry labels. A typical case is clustering or automatic classification. The goal is to group similar items together. What similarity means will depend on the context, and there are many similarity metrics that can be employed in such a task.

The scikit-learn library is organized into submodules. Each submodule contains algorithms and helper methods for a certain class of machine learning models and approaches.

Here is a sample of those submodules, including some example models:

Submodule

Description

Example models

cluster

This is the unsupervised clustering

KMeans and Ward

decomposition

This is the dimensionality reduction

PCA and NMF

ensemble

This involves ensemble-based methods

AdaBoostClassifier,

AdaBoostRegressor,

RandomForestClassifier,

RandomForestRegressor

lda

This stands for latent discriminant analysis

LDA

linear_model

This is the generalized linear model

LinearRegression, LogisticRegression,

Lasso and Perceptron

mixture

This is the mixture model

GMM and VBGMM

naive_bayes

This involves supervised learning based on Bayes' theorem

BaseNB and BernoulliNB, GaussianNB

neighbors

These are k-nearest neighbors

KNeighborsClassifier, KNeighborsRegressor,

LSHForest

neural_network

This involves models based on neural networks

BernoulliRBM

tree

decision trees

DecisionTreeClassifier, DecisionTreeRegressor

While these approaches are diverse, a scikit-learn library abstracts away a lot of differences by exposing a regular interface to most of these algorithms. All of the example algorithms listed in the table implement a fit method, and most of them implement predict as well. These methods represent two phases in machine learning. First, the model is trained on the existing data with the fit method. Once trained, it is possible to use the model to predict the class or value of unseen data with predict. We will see both the methods at work in the next sections.

The scikit-learn library is part of the PyData ecosystem. Its codebase has seen steady growth over the past six years, and with over hundred contributors, it is one of the most active and popular among the scikit toolkits.

In contrast to the heterogeneous domains and applications of machine learning, the data representation in scikit-learn is less diverse, and the basic format that many algorithms expect is straightforward—a matrix of samples and features.

The underlying data structure is a numpy and the ndarray. Each row in the matrix corresponds to one sample and each column to the value of one feature.

There is something like Hello World in the world of machine learning datasets as well; for example, the Iris dataset whose origins date back to 1936. With the standard installation of scikit-learn, you already have access to a couple of datasets, including Iris that consists of 150 samples, each consisting of four measurements taken from three different Iris flower species:

The dataset is packaged as a bunch, which is only a thin wrapper around a dictionary:

Under the data key, we can find the matrix of samples and features, and can confirm its shape:

Each entry in the data matrix has been labeled, and these labels can be looked up in the target attribute:

The target names are encoded. We can look up the corresponding names in the target_names attribute:

This is the basic anatomy of many datasets, such as example data, target values, and target names.

What are the features of a single entry in this dataset?:

The four features are the measurements taken of real flowers: their sepal length and width, and petal length and width. Three different species have been examined: the Iris-Setosa, Iris-Versicolour, and Iris-Virginica.

Machine learning tries to answer the following question: can we predict the species of the flower, given only the measurements of its sepal and petal length?

In the next section, we will see how to answer this question with scikit-learn.

Besides the data about flowers, there are a few other datasets included in the scikit-learn distribution, as follows:

A few datasets are not included, but they can easily be fetched on demand (as these are usually a bit bigger). Among these datasets, you can find a real estate dataset and a news corpus:

These datasets are a great way to get started with the scikit-learn library, and they will also help you to test your own algorithms. Finally, scikit-learn includes functions (prefixed with datasets.make_) to create artificial datasets as well.

If you work with your own datasets, you will have to bring them in a shape that scikit-learn expects, which can be a task of its own. Tools such as Pandas make this task much easier, and Pandas DataFrames can be exported to numpy.ndarray easily with the as_matrix() method on DataFrame.

In this section, we will show short examples for both classification and regression.

Classification problems are pervasive: document categorization, fraud detection, market segmentation in business intelligence, and protein function prediction in bioinformatics.

While it might be possible for hand-craft rules to assign a category or label to new data, it is faster to use algorithms to learn and generalize from the existing data.

We will continue with the Iris dataset. Before we apply a learning algorithm, we want to get an intuition of the data by looking at some values and plots.

All measurements share the same dimension, which helps to visualize the variance in various boxplots:

Supervised learning – classification and regression

We see that the petal length (the third feature) exhibits the biggest variance, which could indicate the importance of this feature during classification. It is also insightful to plot the data points in two dimensions, using one feature for each axis. Also, indeed, our previous observation reinforced that the petal length might be a good indicator to tell apart the various species. The Iris setosa also seems to be more easily separable than the other two species:

Supervised learning – classification and regression

From the visualizations, we get an intuition of the solution to our problem. We will use a supervised method called a Support Vector Machine (SVM) to learn about a classifier for the Iris data. The API separates models and data, therefore, the first step is to instantiate the model. In this case, we pass an optional keyword parameter to be able to query the model for probabilities later:

The next step is to fit the model according to our training data:

With this one line, we have trained our first machine learning model on a dataset. This model can now be used to predict the species of unknown data. If given some measurement that we have never seen before, we can use the predict method on the model:

We see that the classifier has given the versicolor label to the measurement. If we visualize the unknown point in our plots, we see that this seems like a sensible prediction:

Supervised learning – classification and regression

In fact, the classifier is relatively sure about this label, which we can inquire into by using the predict_proba method on the classifier:

Our example consisted of four features, but many problems deal with higher-dimensional datasets and many algorithms work fine on these datasets as well.

We want to show another algorithm for supervised learning problems: linear regression. In linear regression, we try to predict one or more continuous output variables, called regress ands, given a D-dimensional input vector. Regression means that the output is continuous. It is called linear since the output will be modeled with a linear function of the parameters.

We first create a sample dataset as follows:

Given this data, we want to learn a linear function that approximates the data and minimizes the prediction error, which is defined as the sum of squares between the observed and predicted responses:

Many models will learn parameters during training. These parameters are marked with a single underscore at the end of the attribute name. In this model, the coef_ attribute will hold the estimated coefficients for the linear regression problem:

We can plot the prediction over our data as well:

The output of the plot is as follows:

Supervised learning – classification and regression

The above graph is a simple example with artificial data, but linear regression has a wide range of applications. If given the characteristics of real estate objects, we can learn to predict prices. If given the features of the galaxies, such as size, color, or brightness, it is possible to predict their distance. If given the data about household income and education level of parents, we can say something about the grades of their children.

There are numerous applications of linear regression everywhere, where one or more independent variables might be connected to one or more dependent variables.

A lot of existing data is not labeled. It is still possible to learn from data without labels with unsupervised models. A typical task during exploratory data analysis is to find related items or clusters. We can imagine the Iris dataset, but without the labels:

Unsupervised learning – clustering and dimensionality reduction

While the task seems much harder without labels, one group of measurements (in the lower-left) seems to stand apart. The goal of clustering algorithms is to identify these groups.

We will use K-Means clustering on the Iris dataset (without the labels). This algorithm expects the number of clusters to be specified in advance, which can be a disadvantage. K-Means will try to partition the dataset into groups, by minimizing the within-cluster sum of squares.

For example, we instantiate the KMeans model with n_clusters equal to 3:

Similar to supervised algorithms, we can use the fit methods to train the model, but we only pass the data and not target labels:

We already saw attributes ending with an underscore. In this case, the algorithm assigned a label to the training data, which can be inspected with the labels_ attribute:

We can already compare the result of these algorithms with our known target labels:

We quickly relabel the result to simplify the prediction error calculation:

From 150 samples, K-Mean assigned the correct label to 134 samples, which is an accuracy of about 90 percent. The following plot shows the points of the algorithm predicted correctly in grey and the mislabeled points in red:

Unsupervised learning – clustering and dimensionality reduction

As another example for an unsupervised algorithm, we will take a look at Principal Component Analysis (PCA). The PCA aims to find the directions of the maximum variance in high-dimensional data. One goal is to reduce the number of dimensions by projecting a higher-dimensional space onto a lower-dimensional subspace while keeping most of the information.

The problem appears in various fields. You have collected many samples and each sample consists of hundreds or thousands of features. Not all the properties of the phenomenon at hand will be equally important. In our Iris dataset, we saw that the petal length alone seemed to be a good discriminator of the various species. PCA aims to find principal components that explain most of the variation in the data. If we sort our components accordingly (technically, we sort the eigenvectors of the covariance matrix by eigenvalue), we can keep the ones that explain most of the data and ignore the remaining ones, thereby reducing the dimensionality of the data.

It is simple to run PCA with scikit-learn. We will not go into the implementation details, but instead try to give you an intuition of PCA by running it on the Iris dataset, in order to give you yet another angle.

The process is similar to the ones we implemented so far. First, we instantiate our model; this time, the PCA from the decomposition submodule. We also import a standardization method, called StandardScaler, that will remove the mean from our data and scale to the unit variance. This step is a common requirement for many machine learning algorithms:

First, we instantiate our model with a parameter (which specifies the number of dimensions to reduce to), standardize our input, and run the fit_transform function that will take care of the mechanics of PCA:

The result is a dimensionality reduction in the Iris dataset from four (sepal and petal width and length) to two dimensions. It is important to note that this projection is not onto the two existing dimensions, so our new dataset does not consist of, for example, only petal length and width. Instead, the two new dimensions will represent a mixture of the existing features.

The following scatter plot shows the transformed dataset; from a glance at the plot, it looks like we still kept the essence of our dataset, even though we halved the number of dimensions:

Unsupervised learning – clustering and dimensionality reduction

Dimensionality reduction is just one way to deal with high-dimensional datasets, which are sometimes effected by the so called curse of dimensionality.

We have already seen that the machine learning process consists of the following steps:

So far, we omitted an important step that takes place between the training and application: the model testing and validation. In this step, we want to evaluate how well our model has learned.

One goal of learning, and machine learning in particular, is generalization. The question of whether a limited set of observations is enough to make statements about any possible observation is a deeper theoretical question, which is answered in dedicated resources on machine learning.

Whether or not a model generalizes well can also be tested. However, it is important that the training and the test input are separate. The situation where a model performs well on a training input but fails on an unseen test input is called overfitting, and this is not uncommon.

The basic approach is to split the available data into a training and test set, and scikit-learn helps to create this split with the train_test_split function.

We go back to the Iris dataset and perform SVC again. This time we will evaluate the performance of the algorithm on a training set. We set aside 40 percent of the data for testing:

The score function returns the mean accuracy of the given data and labels. We pass the test set for evaluation:

The model seems to perform well, with about 94 percent accuracy on unseen data. We can now start to tweak model parameters (also called hyper parameters) to increase prediction performance. This cycle would bring back the problem of overfitting. One solution is to split the input data into three sets: one for training, validation, and testing. The iterative model of hyper-parameters tuning would take place between the training and the validation set, while the final evaluation would be done on the test set. Splitting the dataset into three reduces the number of samples we can learn from as well.

Cross-validation (CV) is a technique that does not need a validation set, but still counteracts overfitting. The dataset is split into k parts (called folds). For each fold, the model is trained on k-1 folds and tested on the remaining folds. The accuracy is taken as the average over the folds.

We will show a five-fold cross-validation on the Iris dataset, using SVC again:

There are various strategies implemented by different classes to split the dataset for cross-validation: KFold, StratifiedKFold, LeaveOneOut, LeavePOut, LeaveOneLabelOut, LeavePLableOut, ShuffleSplit, StratifiedShuffleSplit, and PredefinedSplit.

Model verification is an important step and it is necessary for the development of robust machine learning solutions.

In this chapter, we took a whirlwind tour through one of the most popular Python machine learning libraries: scikit-learn. We saw what kind of data this library expects. Real-world data will seldom be ready to be fed into an estimator right away. With powerful libraries, such as Numpy and, especially, Pandas, you already saw how data can be retrieved, combined, and brought into shape. Visualization libraries, such as matplotlib, help along the way to get an intuition of the datasets, problems, and solutions.

During this chapter, we looked at a canonical dataset, the Iris dataset. We also looked at it from various angles: as a problem in supervised and unsupervised learning and as an example for model verification.

In total, we have looked at four different algorithms: the Support Vector Machine, Linear Regression, K-Means clustering, and Principal Component Analysis. Each of these alone is worth exploring, and we barely scratched the surface, although we were able to implement all the algorithms with only a few lines of Python.

There are numerous ways in which you can take your knowledge of the data analysis process further. Hundreds of books have been published on machine learning, so we only want to highlight a few here: Building Machine Learning Systems with Python by Richert and Coelho, will go much deeper into scikit-learn as we couldn't in this chapter. Learning from Data by Abu-Mostafa, Magdon-Ismail, and Lin, is a great resource for a solid theoretical foundation of learning problems in general.

The most interesting applications will be found in your own field. However, if you would like to get some inspiration, we recommend that you look at the www.kaggle.com website that runs predictive modeling and analytics competitions, which are both fun and insightful.

Practice exercises

Are the following problems supervised or unsupervised? Regression or classification problems?:

  • Recognizing coins inside a vending machine
  • Recognizing handwritten digits
  • If given a number of facts about people and economy, we want to estimate consumer spending
  • If given the data about geography, politics, and historical events, we want to predict when and where a human right violation will eventually take place
  • If given the sounds of whales and their species, we want to label yet unlabeled whale sound recordings

Look up one of the first machine learning models and algorithms: the perceptron. Try the perceptron on the Iris dataset and estimate the accuracy of the model. How does the perceptron compare to the SVC from this chapter?

About the Authors
  • Ivan Idris

    Ivan Idris has an MSc in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a Java developer, data warehouse developer, and QA analyst. His main professional interests are business intelligence, big data, and cloud computing. Ivan Idris enjoys writing clean, testable code and interesting technical articles. Ivan Idris is the author of NumPy 1.5. Beginner's Guide and NumPy Cookbook by Packt Publishing.

    Browse publications by this author
  • Luiz Felipe Martins

    Luiz Felipe Martins holds a PhD in applied mathematics from Brown University and has worked as a researcher and educator for more than 20 years. His research is mainly in the field of applied probability. He has been involved in developing code for the open source homework system, WeBWorK, where he wrote a library for the visualization of systems of differential equations. He was supported by an NSF grant for this project. Currently, he is an Associate Professor in the Department of Mathematics at Cleveland State University, Cleveland, Ohio, where he has developed several courses in applied mathematics and scientific computing. His current duties include coordinating all first-year calculus sessions.

    Browse publications by this author
  • Martin Czygan

    Martin Czygan studied German literature and computer science in Leipzig, Germany. He has been working as a software engineer for more than 10 years. For the past eight years, he has been diving into Python, and is still enjoying it. In recent years, he has been helping clients to build data processing pipelines and search and analytics systems.

    Browse publications by this author
  • Phuong Vo.T.H

    Phuong Vo.T.H has a MSc degree in computer science, which is related to machine learning. After graduation, she continued to work in some companies as a data scientist. She has experience in analyzing users' behavior and building recommendation systems based on users' web histories. She loves to read machine learning and mathematics algorithm books, as well as data analysis articles.

    Browse publications by this author
  • Magnus Vilhelm Persson

    Magnus Vilhelm Persson is a scientist with a passion for Python and open source software usage and development. He obtained his PhD in Physics/Astronomy from Copenhagen Universitys Centre for Star and Planet Formation (StarPlan) in 2013. Since then, he has continued his research in Astronomy at various academic institutes across Europe. In his research, he uses various types of data and analysis to gain insights into how stars are formed. He has participated in radio shows about Astronomy and also organized workshops and intensive courses about the use of Python for data analysis. You can check out his web page at http://vilhelm.nu.

    Browse publications by this author
Latest Reviews (15 reviews total)
Have done a somewhat detailed review of the books content. The coverage is rich and quite comprehensive but I wouldn't quite affirm the "end-to-end" claim, as many important topics do not seem to have been covered in sufficient depth, perhaps quite understandably given the breadth of the coverage.
poor service.. send me the link
Easy & Convenient purchase
Python: End-to-end Data Analysis
Unlock this book and the full library FREE for 7 days
Start now