Packt+ | Advance your knowledge in tech

You're reading from NumPy Cookbook

Product type Book

Published in Oct 2012

Publisher Packt

ISBN-13 9781849518925

Pages 226 pages

Edition 1st Edition

Languages

Python

Concepts

Data Science

Table of Contents (17) Chapters

NumPy Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. Winding Along with IPython

2. Advanced Indexing and Array Concepts

3. Get to Grips with Commonly Used Functions

4. Connecting NumPy with the Rest of the World

5. Audio and Image Processing

6. Special Arrays and Universal Functions

7. Profiling and Debugging

8. Quality Assurance

9. Speed Up Code with Cython

10. Fun with Scikits

Index

Chapter 10. Fun with Scikits

In this chapter, we will cover the following topics:

Installing scikits-learn
Loading an example dataset
Clustering Dow Jones stocks with scikits-learn
Installing scikits-statsmodels
Performing a normality test with scikits-statsmodels
Installing scikits-image
Detecting corners
Detecting edges
Installing pandas
Estimating stock returns correlation with Pandas
Loading data as pandas objects from statsmodels
Resampling time series data

Introduction

Scikits are small, independent projects that are related to SciPy in some way, but are not a part of SciPy. These projects are not entirely independent, but operate under an umbrella, as a consortium of sorts. In this chapter, we will discuss several Scikits projects, such as the following:

scikits-learn, a machine learning package
scikits-statsmodels, a statistics package
scikits-image, an image processing package
pandas, a data analysis package

Installing scikits-learn

The scikits-learn project aims to provide an API for machine learning. What I like most about it is the amazing documentation. We can install scikits-learn with the package manager of our operating system. This option may or may not be available, depending on the operating system, but should be the most convenient route.

Windows users can just download an installer from the project website. On Debian and Ubuntu, the project is named python-sklearn. On MacPorts, the ports are named py26-scikits-learn and py27-scikits-learn. We can also install from source, or using easy_install. There are third-party distributions from Python(x, y), Enthought, and NetBSD.

Getting ready

You need to have SciPy and NumPy installed. Go back to Chapter 1, Winding Along with Ipython, for instructions, if necessary.

How to do it...

Let us now see how we can install the scikits-learn project.

Installing with easy_install: We can install by typing any one of the following commands, at the...

Loading an example dataset

The scikits-learn project comes with a number of datasets and sample images with which we can experiment. In this recipe, we will load an example dataset, that is included with the scikits-learn distribution. The datasets hold data as a NumPy, two-dimensional array and metadata linked to the data.

How to do it...

We will load a sample data set of the Boston house prices. It is a tiny dataset, so if you are looking for a house in Boston, don't get too excited. There are more datasets as described in http://scikit-learn.org/dev/modules/classes.html#module-sklearn.datasets.

We will look at the shape of the raw data, and its maximum and minimum value. The shape is a tuple , representing the dimensions of the NumPy array. We will do the same for the target array, which contains values that are the learning objectives. The following code accomplishes our goals:

from sklearn import datasets

boston_prices = datasets.load_boston()
print "Data shape", boston_prices.data.shape...

Clustering Dow Jones stocks with scikits-learn

Clustering is a type of machine learning algorithm, which aims to group items based on similarities. In this example, we will use the log returns of stocks in the Dow Jones Industrial Index to cluster. Most of the steps of this recipe have already passed the review in previous chapters.

How to do it...

First, we will download the EOD price data for those stocks from Yahoo Finance. Second, we will calculate a square affinity matrix. Finally, we will cluster the stocks with the AffinityPropagation class.

Downloading the price data.

We will download price data for 2011 using the stock symbols of the DJI Index. In this example, we are only interested in the close price:

# 2011 to 2012
start = datetime.datetime(2011, 01, 01)
end = datetime.datetime(2012, 01, 01)

#Dow Jones symbols
symbols = ["AA", "AXP", "BA", "BAC", "CAT", "CSCO", "CVX", "DD", "DIS", "GE", "HD", "HPQ", "IBM", "INTC", "JNJ", "JPM", "KFT", "KO", "MCD", "MMM", "MRK", "MSFT", "PFE",...

Installing scikits-statsmodels

The scikits-statsmodels package focuses on statistical modeling. It can be integrated with NumPy and Pandas (more about Pandas later in this chapter).

How to do it...

Source and binaries can be downloaded from http://statsmodels.sourceforge.net/install.html . If you are installing from source, you need to run the following command:

python setup.py install

If you are using setuptools, the command is:

easy_install statsmodels

Performing a normality test with scikits-statsmodels

The scikits-statsmodels package has lots of statistical tests. We will see an example of such a test—the Anderson-Darling test for normality (http://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test).

How to do it...

We will download price data as in the previous recipe; but this time for a single stock. Again, we will calculate the log returns of the close price of this stock, and use that as an input for the normality test function.

This function returns a tuple containing a second element—a p-value between zero and one. The complete code for this tutorial is as follows:

import datetime
import numpy
from matplotlib import finance
from statsmodels.stats.adnorm import normal_ad
import sys

#1. Download price data

# 2011 to 2012
start = datetime.datetime(2011, 01, 01)
end = datetime.datetime(2012, 01, 01)

print "Retrieving data for", sys.argv[1]
quotes = finance.quotes_historical_yahoo(sys.argv[1], start, end, asobject=True)

close = numpy...

Installing scikits-image

scikits image is a toolkit for image processing, which requires PIL, SciPy, Cython, and NumPy. There are Windows installers available for it. It is part of Enthought Python Distribution, as well as the Python(x, y) distribution.

How to do it...

As usual, we can install using either of the following two commands:

pip install -U scikits-image
easy_install -U scikits-image

Again, you might need to run these commands as root.

Another option is to obtain the latest development version by cloning the Git repository, or downloading the repository as a zip file from Github. Then, you will need to run the following command:

python setup.py install

Detecting corners

Corner detection (http://en.wikipedia.org/wiki/Corner_detection ) is a standard technique in Computer Vision. scikits-image offers a Harris Corner Detector, which is great, because corner detection is pretty complicated. Obviously, we could do it ourselves from scratch, but that would violate the cardinal rule of not reinventing the wheel.

Getting ready

You might need to install jpeglib on your system to be able to load the scikits-learn image, which is a JPEG file. If you are on Windows, use the installer; otherwise, download the distribution, unpack it, and build from the top folder with the following command:

./configure
  make
    sudo make install

How to do it...

We will load a sample image from scikits-learn. This is not absolutely necessary for this example; you can use any other image instead.

Load the sample image.
scikits-learn currently has two sample JPEG images in a dataset structure. We will look at the first image only:
```
dataset = load_sample_images()
img = dataset...
```

Detecting edges

Edge detection is another popular image processing technique (http://en.wikipedia.org/wiki/Edge_detection ). scikits-image has a Canny filter implementation, based on the standard deviation of the Gaussian distribution, which can perform edge detection out of the box. In addition to the image data as a 2D array, this filter accepts the following parameters:

Standard deviation of the Gaussian distribution
Lower bound threshold
Upper bound threshold

How to do it...

We will use the same image as in the previous recipe. The code is almost the same. You should pay extra attention to the one line where we call the Canny filter function:

from sklearn.datasets import load_sample_images 
from matplotlib.pyplot import imshow, show, axis
import numpy
import skimage.filter

dataset = load_sample_images()
img = dataset.images[0] 
edges = skimage.filter.canny(img[..., 0], 2, 0.3, 0.2)
axis('off')
imshow(edges)
show()

The code produces an image of the edges within the original picture, as shown...

Installing Pandas

Pandas is a Python library for data analysis. It has some similarities with the R programming language, which are not coincidental. R is a specialized programming language popular with data scientists. For instance, the core DataFrame object is inspired by R.

How to do it...

On PyPi, the project is called pandas. So, for instance, run either of the following two command:

sudo easy_install -U pandas
pip install pandas

If you are using a Linux package manager, you will need to install the python-pandas project. On Ubuntu, you would do the following:

sudo apt-get install python-pandas

You can also install from source (requires Git):

git clone git://github.com/pydata/pandas.git 
cd pandas 
python setup.py install

Estimating stock returns correlation with Pandas

A Pandas DataFrame is a matrix and dictionary-like data structure similar to the functionality available in R. In fact, it is the central data structure in Pandas and you can apply all kinds of operations on it. It is quite common to have a look, for instance, at the correlation matrix of a portfolio. So let's do that.

How to do it...

First, we will create the DataFrame with Pandas for each symbol's daily log returns. Then we will join these on the date. At the end, the correlation will be printed, and plot will be shown.

Creating the data frame.
To create the data frame, we will create a dictionary containing stock symbols as keys, and the corresponding log returns as values. The data frame itself has the date as index and the stock symbols as column labels:
```
data = {}

for i in xrange(len(symbols)):
  data[symbols[i]] = numpy.diff(numpy.log(close[i]))

df = pandas.DataFrame(data, index=dates[0][:-1], columns=symbols)
```
Operating on the data frame...

Loading data as pandas objects from statsmodels

Statsmodels has quite a lot of sample datasets in its distributions. The complete list can be found at https://github.com/statsmodels/statsmodels/tree/master/statsmodels/datasets .

In this tutorial, we will concentrate on the copper dataset, which contains information about copper prices, world consumption, and other parameters.

Getting ready

Before we start, we might need to install patsy. It is easy enough to see if this is necessary just run the code. If you get errors related to patsy, you will need to execute any one of the following two commands:

sudo easy_install patsy
pip install --upgrade patsy

How to do it...

In this section, we will see how we can load a dataset from statsmodels as a Pandas DataFrame or Series object.

Loading the data.
The function we need to call is load_pandas. Load the data as follows:
```
data = statsmodels.api.datasets.copper.load_pandas()
```
This loads the data in a DataSet object, which contains pandas objects.
Fitting...

Resampling time series data

In this tutorial, we will learn how to resample time series with Pandas.

How to do it...

We will download the daily price time series data for AAPL, and resample it to monthly data by computing the mean. We will accomplish this by creating a Pandas DataFrame, and calling its resample method.

Creating a date-time index.
Before we can create a Pandas DataFrame, we need to create a DatetimeIndex method to pass to the DataFrame constructor. Create the index from the downloaded quotes data as follows:
```
dt_idx = pandas.DatetimeIndex(quotes.date)
```
Creating the data frame.
Once we have the date-time index, we can use it together with the close prices to create a data frame:
```
df = pandas.DataFrame(quotes.close, index=dt_idx, columns=[symbol])
```
Resample.
Resample the time series to monthly frequency, by computing the mean:
```
resampled = df.resample('M', how=numpy.mean)
print resampled 
```
The resampled time series, as shown in the following, has one value for each month:
```
                ...
```