Python Data Analysis Cookbook

Chapter 1. Laying the Foundation for Reproducible Data Analysis

In this chapter, we will cover the following recipes:

Setting up Anaconda
Installing the Data Science Toolbox
Creating a virtual environment with virtualenv and virtualenvwrapper
Sandboxing Python applications with Docker images
Keeping track of package versions and history in IPython Notebooks
Configuring IPython
Learning to log for robust error checking
Unit testing your code
Configuring pandas
Configuring matplotlib
Seeding random number generators and NumPy print options
Standardizing reports, code style, and data access

Setting up Anaconda

Anaconda is a free Python distribution for data analysis and scientific computing. It has its own package manager, conda. The distribution includes more than 200 Python packages, which makes it very convenient. For casual users, the Miniconda distribution may be the better choice. Miniconda contains the conda package manager and Python. The technical editors use Anaconda, and so do I. But don't worry, I will describe in this book alternative installation instructions for readers who are not using Anaconda. In this recipe, we will install Anaconda and Miniconda and create a virtual environment.

Getting ready

The procedures to install Anaconda and Miniconda are similar. Obviously, Anaconda requires more disk space. Follow the instructions on the Anaconda website at http://conda.pydata.org/docs/install/quick.html (retrieved Mar 2016). First, you have to download the appropriate installer for your operating system and Python version. Sometimes, you can choose between a GUI and a command-line installer. I used the Python 3.4 installer, although my system Python version is v2.7. This is possible because Anaconda comes with its own Python. On my machine, the Anaconda installer created an anaconda directory in my home directory and required about 900 MB. The Miniconda installer installs a miniconda directory in your home directory.

How to do it...

Now that Anaconda or Miniconda is installed, list the packages with the following command:
```
$ conda list
```
For reproducibility, it is good to know that we can export packages:
```
$ conda list --export
```
The preceding command prints packages and versions on the screen, which you can save in a file. You can install these packages with the following command:
```
$ conda create -n ch1env --file <export file>
```
This command also creates an environment named ch1env.
The following command creates a simple testenv environment:
```
$ conda create --name testenv python=3
```
On Linux and Mac OS X, switch to this environment with the following command:
```
$ source activate testenv
```
On Windows, we don't need source. The syntax to switch back is similar:
```
$ [source] deactivate
```
The following command prints export information for the environment in the YAML (explained in the following section) format:
```
$ conda env export -n testenv
```
To remove the environment, type the following (note that even after removing, the name of the environment still exists in ~/.conda/environments.txt):
```
$ conda remove -n testenv --all
```
Search for a package as follows:
```
$ conda search numpy
```
In this example, we searched for the NumPy package. If NumPy is already present, Anaconda shows an asterisk in the output at the corresponding entry.
Update the distribution as follows:
```
$ conda update conda
```

There's more...

The .condarc configuration file follows the YAML syntax.

Note

YAML is a human-readable configuration file format with the extension .yaml or .yml. YAML was initially released in 2011, with the latest release in 2009. The YAML homepage is at http://yaml.org/ (retrieved July 2015).

You can find a sample configuration file at http://conda.pydata.org/docs/install/sample-condarc.html (retrieved July 2015). The related documentation is at http://conda.pydata.org/docs/install/config.html (retrieved July 2015).

Installing the Data Science Toolbox

The Data Science Toolbox (DST) is a virtual environment based on Ubuntu for data analysis using Python and R. Since DST is a virtual environment, we can install it on various operating systems. We will install DST locally, which requires VirtualBox and Vagrant. VirtualBox is a virtual machine application originally created by Innotek GmbH in 2007. Vagrant is a wrapper around virtual machine applications such as VirtualBox created by Mitchell Hashimoto.

Getting ready

You need to have in the order of 2 to 3 GB free for VirtualBox, Vagrant, and DST itself. This may vary by operating system.

How to do it...

Installing DST requires the following steps:

Install VirtualBox by downloading an installer for your operating system and architecture from https://www.virtualbox.org/wiki/Downloads (retrieved July 2015) and running it. I installed VirtualBox 4.3.28-100309 myself, but you can just install whatever the most recent VirtualBox version at the time is.
Install Vagrant by downloading an installer for your operating system and architecture from https://www.vagrantup.com/downloads.html (retrieved July 2015). I installed Vagrant 1.7.2 and again you can install a more recent version if available.
Create a directory to hold the DST and navigate to it with a terminal. Run the following command:
```
$ vagrant init data-science-toolbox/dst
$ vagrant up
```
The first command creates a VagrantFile configuration file. Most of the content is commented out, but the file does contain links to documentation that might be useful. The second command creates the DST and initiates a download that could take a couple of minutes.
Connect to the virtual environment as follows (on Windows use putty):
```
$ vagrant ssh
```
View the preinstalled Python packages with the following command:
```
vagrant@data-science-toolbox:~$ pip freeze
```
The list is quite long; in my case it contained 32 packages. The DST Python version as of July 2015 was 2.7.6.

When you are done with the DST, log out and suspend (you can also halt it completely) the VM:

vagrant@data-science-toolbox:~$ logout
Connection to 127.0.0.1 closed.
$ vagrant suspend
==> default: Saving VM state and suspending execution...

How it works...

Virtual machines (VMs) emulate computers in software. VirtualBox is an application that creates and manages VMs. VirtualBox stores its VMs in your home folder, and this particular VM takes about 2.2 GB of storage.

Ubuntu is an open source Linux operating system, and we are allowed by its license to create virtual machines. Ubuntu has several versions; we can get more info with the lsb_release command:

vagrant@data-science-toolbox:~$ lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 14.04 LTS
Release:    14.04
Codename:    trusty

Vagrant used to only work with VirtualBox, but currently it also supports VMware, KVM, Docker, and Amazon EC2. Vagrant calls virtual machines boxes. Some of these boxes are available for everyone at http://www.vagrantbox.es/ (retrieved July 2015).

Creating a virtual environment with virtualenv and virtualenvwrapper

Virtual environments provide dependency isolation for small projects. They also keep your site-packages directory small. Since Python 3.3, virtualenv has been part of the standard Python distribution. The virtualenvwrapper Python project has some extra convenient features for virtual environment management. I will demonstrate virtualenv and virtualenvwrapper functionality in this recipe.

Getting ready

You need Python 3.3 or later. You can install virtualenvwrapper with pip command as follows:

$ [sudo] pip install virtualenvwrapper

On Linux and Mac, it's necessary to do some extra work—specifying a directory for the virtual environments and sourcing a script:

$ export WORKON_HOME=/tmp/envs
$ source /usr/local/bin/virtualenvwrapper.sh

Windows has a separate version, which you can install with the following command:

$ pip install virtualenvwrapper-win

How to do it...

Create a virtual environment for a given directory with the pyvenv script part of your Python distribution:
```
$ pyvenv /tmp/testenv
$ ls
bin        include        lib        pyvenv.cfg
```
In this example, we created a testenv directory in the /tmp directory with several directories and a configuration file. The configuration file pyvenv.cfg contains the Python version and the home directory of the Python distribution.
Activate the environment on Linux or Mac by sourcing the activate script, for example, with the following command:
```
$ source bin/activate
```
On Windows, use the activate.bat file.
You can now install packages in this environment in isolation. When you are done with the environment, switch back on Linux or Mac with the following command:
```
$ deactivate
```
On Windows, use the deactivate.bat file.
Alternatively, you could use virtualenvwrapper. Create and switch to a virtual environment with the following command:
```
vagrant@data-science-toolbox:~$ mkvirtualenv env2
```
Deactivate the environment with the deactivate command:
```
(env2)vagrant@data-science-toolbox:~$ deactivate
```

Delete the environment with the rmvirtualenv command:

vagrant@data-science-toolbox:~$ rmvirtualenv env2

Sandboxing Python applications with Docker images

Docker uses Linux kernel features to provide an extra virtualization layer. Docker was created in 2013 by Solomon Hykes. Boot2Docker allows us to install Docker on Windows and Mac OS X too. Boot2Docker uses a VirtualBox VM that contains a Linux environment with Docker. In this recipe, we will set up Docker and download the continuumio/miniconda3 Docker image.

Getting ready

The Docker installation docs are saved at https://docs.docker.com/index.html (retrieved July 2015). I installed Docker 1.7.0 with Boot2Docker. The installer requires about 133 MB. However, if you want to follow the whole recipe, you will need several gigabytes.

How to do it...

Once Boot2Docker is installed, you need to initialize the environment. This is only necessary once, and Linux users don't need this step:

$ boot2docker init
Latest release for github.com/boot2docker/boot2docker is v1.7.0
Downloading boot2docker ISO image...
Success: downloaded https://github.com/boot2docker/boot2docker/releases/download/v1.7.0/boot2docker.iso

In the preceding step, you downloaded a VirtualBox VM to a directory such as /VirtualBox\ VMs/boot2docker-vm/.
The next step for Mac OS X and Windows users is to start the VM:
```
$ boot2docker start
```
Check the Docker environment by starting a sample container:
```
$ docker run hello-world
```
Note
Some people reported a hopefully temporary issue of not being able to connect. The issue can be resolved by issuing commands with an extra argument, for instance:
```
$ docker [--tlsverify=false] run hello-world
```
Docker images can be made public. We can search for such images and download them. In Setting up Anaconda, we installed Anaconda; however, Anaconda and Miniconda Docker images also exist. Use the following command:
```
$ docker search continuumio
```
The preceding command shows a list of Docker images from Continuum Analytics – the company that developed Anaconda and Miniconda. Download the Miniconda 3 Docker image as follows (if you prefer using my container, skip this):
```
$ docker pull continuumio/miniconda3
```
Start the image with the following command:
```
$ docker run -t -i continuumio/miniconda3 /bin/bash
```
We start out as root in the image.
The command $ docker images should list the continuumio/miniconda3 image as well. If you prefer not to install too much software (possibly only Docker and Boot2Docker) for this book, you should use the image I created. It uses the continuumio/miniconda3 image as template. This image allows you to execute Python scripts in the current working directory on your computer, while using installed software from the Docker image:
```
$ docker run -it -p 8888:8888 -v $(pwd):/usr/data -w /usr/data "ivanidris/pydacbk:latest" python <somefile>.py 
```

You can also run a IPython notebook in your current working directory with the following command:

$ docker run -it -p 8888:8888 -v $(pwd):/usr/data -w /usr/data "ivanidris/pydacbk:latest" sh -c "ipython notebook --ip=0.0.0.0 --no-browser"

Then, go to either http://192.168.59.103:8888 or http://localhost:8888 to view the IPython home screen. You might have noticed that the command lines are quite long, so I will post additional tips and tricks to make life easier on https://pythonhosted.org/dautil (work in progress).
The Boot2Docker VM shares the /Users directory on Mac OS X and the C:\Users directory on Windows. In general and on other operating systems, we can mount directories and copy files from the container as described in https://docs.docker.com/userguide/dockervolumes/ (retrieved July 2015).
Shut down the VM (unless you are on Linux, where you use the docker command instead) with the following command:
```
$ boot2docker down
```

How it works...

Docker Hub acts as a central registry for public and private Docker images. In this recipe, we downloaded images via this registry. To push an image to Docker Hub, we need to create a local registry first. The way Docker Hub works is in many ways comparable to the way source code repositories such as GitHub work. You can commit changes as well as push, pull, and tag images. The continuumio/miniconda3 image is configured with a special file, which you can find at https://github.com/ContinuumIO/docker-images/blob/master/miniconda3/Dockerfile (retrieved July 2015). In this file, you can read which image was used as base, the name of the maintainer, and the commands used to build the image.

Keeping track of package versions and history in IPython Notebook

The IPython Notebook was added to IPython 0.12 in December 2011. Many Pythonistas feel that the IPython Notebook is essential for reproducible data analysis. The IPython Notebook is comparable to commercial products such as Mathematica, MATLAB, and Maple. It is an interactive web browser-based environment. In this recipe, we will see how to keep track of package versions and store IPython sessions in the context of reproducible data analysis. By the way, the IPython Notebook has been renamed Jupyter Notebook.

Getting ready

For this recipe, you will need a recent IPython installation. The instructions to install IPython are at http://ipython.org/install.html (retrieved July 2015). Install it using the pip command:

$ [sudo] pip install ipython/jupyter

If you have installed IPython via Anaconda already, check for updates with the following commands:

$ conda update conda
$ conda update ipython ipython-notebook ipython-qtconsole

I have IPython 3.2.0 as part of the Anaconda distribution.

How to do it...

We will install log a Python session and use the watermark extension to track package versions and other information. Start an IPython shell or notebook. When we start a session, we can use the command line switch --logfile=<file name>.py. In this recipe, we use the %logstart magic (IPython terminology) function:

In [1]: %logstart cookbook_log.py rotate
Activating auto-logging. Current session state plus future input saved.
Filename       : cookbook_log.py
Mode           : rotate
Output logging : False
Raw input log  : False
Timestamping   : False
State          : active

This example invocation started logging to a file in rotate mode. Both the filename and mode are optional. Turn logging off and back on again as follows:

In [2]: %logoff
Switching logging OFF

In [3]: %logon
Switching logging ON

Install the watermark magic from Github with the following command:

In [4]: %install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py

The preceding line downloads a Python file, in my case, to ~/.ipython/extensions/watermark.py. Load the extension by typing the following line:

%load_ext watermark

The extension can place timestamps as well as software and hardware information. Get additional usage documentation and version (I installed watermark 1.2.2) with the following command:

%watermark?

For example, call watermark without any arguments:

In [7]: %watermark
… Omitting time stamp …

CPython 3.4.3
IPython 3.2.0

compiler   : Omitting
system     : Omitting
release    : 14.3.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit

I omitted the timestamp and other information for personal reasons. A more complete example follows with author name (-a), versions of packages specified as a comma-separated string (-p), and custom time (-c) in a strftime() based format:

In [8]: %watermark -a "Ivan Idris" -v -p numpy,scipy,matplotlib -c '%b %Y' -w
Ivan Idris 'Jul 2015'

CPython 3.4.3
IPython 3.2.0

numpy 1.9.2
scipy 0.15.1
matplotlib 1.4.3
watermark v. 1.2.2

How it works...

The IPython logger writes commands you type to a Python file. Most of the lines are in the following format:

get_ipython().magic('STRING_YOU_TYPED')

You can replay the session with %load <log file>. The logging modes are described in the following table:

Mode	Description
over	This mode overwrites existing log files.
backup	If a log file exists with the same name, the old file is renamed.
append	This mode appends lines to already existing files.
rotate	This mode rotates log files by incrementing numbers, so that log files don't get too big.

We used a custom magic function available on the Internet. The code for the function is in a single Python file and it should be easy for you to follow. If you want different behavior, you just need to modify the file.

Configuring IPython

IPython has an elaborate configuration and customization system. The components of the system are as follows:

IPython provides default profiles, but we can create our own profiles
Various settable options for the shell, kernel, Qt console, and notebook
Customization of prompts and colors
Extensions we saw in Keeping track of package versions and history in IPython notebooks
Startup files per profile

I will demonstrate some of these components in this recipe.

Getting ready

You need IPython for this recipe, so (if necessary) have a look at the Getting ready section of Keeping track of package versions and history in IPython notebooks.

How to do it...

Let's start with a startup file. I have a directory in my home directory at .ipython/profile_default/startup, which belongs to the default profile. This directory is meant for startup files. IPython detects Python files in this directory and executes them in lexical order of filenames. Because of the lexical order, it is convenient to name the startup files with a combination of digits and strings, for example, 0000-watermark.py. Put the following code in the startup file:

get_ipython().magic('%load_ext watermark')
get_ipython().magic('watermark -a "Ivan Idris" -v -p numpy,scipy,matplotlib -c \'%b %Y\' -w')

This startup file loads the extension we used in Keeping track of package versions and history in IPython notebooks and shows information about package versions. Other use cases include importing modules and defining functions. IPython stores commands in a SQLite database, so you could gather statistics to find common usage patterns. The following script prints source lines and associated counts from the database for the default profile sorted by counts (the code is in the ipython_history.py file in this book's code bundle):

import sqlite3
from IPython.utils.path import get_ipython_dir
import pprint
import os

def print_history(file):
    with sqlite3.connect(file) as con:
        c = con.cursor()
        c.execute("SELECT count(source_raw) as csr,\
                  source_raw FROM history\
                  GROUP BY source_raw\
                  ORDER BY csr")
        result = c.fetchall()
        pprint.pprint(result)
        c.close()

hist_file = '%s/profile_default/history.sqlite' % get_ipython_dir()

if os.path.exists(hist_file):
    print_history(hist_file)
else:
    print("%s doesn't exist" % hist_file)

The highlighted SQL query does the bulk of the work. The code is self-explanatory. If it is not clear, I recommend reading Chapter 8, Text Mining and Social Network, of my book Python Data Analysis, Packt Publishing.

The other configuration option I mentioned is profiles. We can use the default profiles or create our own profiles on a per project or functionality basis. Profiles act as sandboxes and you can configure them separately. Here's the command to create a profile:

$ ipython profile create [newprofile]

The configuration files are Python files and their names end with _config.py. In these files, you can set various IPython options. Set the option to automatically log the IPython session as follows:

c = get_config()

c.TerminalInteractiveShell.logstart=True

The first line is usually included in configuration files and gets the root IPython configuration object. The last line tells IPython that we want to start logging immediately on startup so you don't have to type %logstart.

Alternatively, you can also set the log file name with the following command:

c.TerminalInteractiveShell.logfile='mylog_file.py'

You can also use the following configuration line that ensures logging in append mode:

c.TerminalInteractiveShell.logappend='mylog_file.py'

Learning to log for robust error checking

Notebooks are useful to keep track of what you did and what went wrong. Logging works in a similar fashion, and we can log errors and other useful information with the standard Python logging library.

For reproducible data analysis, it is good to know the modules our Python scripts import. In this recipe, I will introduce a minimal API from dautil that logs package versions of imported modules in a best effort manner.

Getting ready

In this recipe, we import NumPy and pandas, so you may need to import them. See the Configuring pandas recipe for pandas installation instructions. Installation instructions for NumPy can be found at http://docs.scipy.org/doc/numpy/user/install.html (retrieved July 2015). Alternatively, install NumPy with pip using the following command:

$ [sudo] pip install numpy

The command for Anaconda users is as follows:

$ conda install numpy

I have installed NumPy 1.9.2 via Anaconda. We also require AppDirs to find the appropriate directory to store logs. Install it with the following command:

$ [sudo] pip install appdirs

I have AppDirs 1.4.0 on my system.

How to do it...

To log, we need to create and set up loggers. We can either set up the loggers with code or use a configuration file. Configuring loggers with code is the more flexible option, but configuration files tend to be more readable. I use the log.conf configuration file from dautil:

[loggers]
keys=root

[handlers]
keys=consoleHandler,fileHandler

[formatters]
keys=simpleFormatter

[logger_root]
level=DEBUG
handlers=consoleHandler,fileHandler

[handler_consoleHandler]
class=StreamHandler
level=INFO
formatter=simpleFormatter
args=(sys.stdout,)

[handler_fileHandler]
class=dautil.log_api.VersionsLogFileHandler
formatter=simpleFormatter
args=('versions.log',)

[formatter_simpleFormatter]
format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
datefmt=%d-%b-%Y

The file configures a logger to log to a file with the DEBUG level and to the screen with the INFO level. So, the logger logs more to the file than to the screen. The file also specifies the format of the log messages. I created a tiny API in dautil, which creates a logger with its get_logger() function and uses it to log the package versions of a client program with its log() function. The code is in the log_api.py file of dautil:

from pkg_resources import get_distribution
from pkg_resources import resource_filename
import logging
import logging.config
import pprint
from appdirs import AppDirs
import os


def get_logger(name):
    log_config = resource_filename(__name__, 'log.conf')
    logging.config.fileConfig(log_config)
    logger = logging.getLogger(name)

    return logger


def shorten(module_name):
    dot_i = module_name.find('.')

    return module_name[:dot_i]


def log(modules, name):
    skiplist = ['pkg_resources', 'distutils']

    logger = get_logger(name)
    logger.debug('Inside the log function')

    for k in modules.keys():
        str_k = str(k)

        if '.version' in str_k:
            short = shorten(str_k)

            if short in skiplist:
                continue

            try:
                logger.info('%s=%s' % (short,    
                            get_distribution(short).version))
            except ImportError:
                logger.warn('Could not impport', short)


class VersionsLogFileHandler(logging.FileHandler):
    def __init__(self, fName):
        dirs = AppDirs("PythonDataAnalysisCookbook", 
                       "Ivan Idris")
        path = dirs.user_log_dir
        print(path)

        if not os.path.exists(path):
            os.mkdir(path)

        super(VersionsLogFileHandler, self).__init__(
              os.path.join(path, fName))

The program that uses the API is in the log_demo.py file in this book's code bundle:

import sys
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from dautil import log_api

log_api.log(sys.modules, sys.argv[0])

How it works...

We configured a handler (VersionsLogFileHandler) that writes to file and a handler (StreamHandler) that displays messages on the screen. StreamHandler is a class in the Python standard library. To configure the format of the log messages, we used the SimpleFormater class from the Python standard library.

The API I made goes through modules listed in the sys.modules variable and tries to get the versions of the modules. Some of the modules are not relevant for data analysis, so we skip them. The log() function of the API logs a DEBUG level message with the debug() method. The info() method logs the package version at INFO level.

Unit testing your code

If code doesn't do what you want, it's hard to do reproducible data analysis. One way to gain control of your code is to test it. If you have tested code manually, you know it is repetitive and boring. When a task is boring and repetitive, you should automate it.

Unit testing automates testing and I hope you are familiar with it. When you learn unit testing for the first time, you start with simple tests such as comparing strings or numbers. However, you hit a wall when file I/O or other resources come into the picture. It turns out that in Python we can mock resources or external APIs easily. The packages needed are even part of the standard Python library. In the Learning to log for robust error checking recipe, we logged messages to a file. If we unit test this code, we don't want to trigger logging from the test code. In this recipe, I will show you how to mock the logger and other software components we need.

Getting ready

Familiarize yourself with the code under test in log_api.py.

How to do it...

The code for this recipe is in the test_log_api.py file of dautil. We start by importing the module under test and the Python functionality we need for unit testing:

from dautil import log_api
import unittest
from unittest.mock import create_autospec
from unittest.mock import patch

Define a class that contains the test code:

class TestLogApi(unittest.TestCase):

Make the unit tests executable with the following lines:

if __name__ == '__main__':
    unittest.main()

If we call Python functions with the wrong number of arguments, we expect to get a TypeError. The following tests check for that:

    def test_get_logger_args(self):
        mock_get_logger =       create_autospec(log_api.get_logger, return_value=None)
        mock_get_logger('test')
        mock_get_logger.assert_called_once_with('test')

    def test_log_args(self):
        mock_log = create_autospec(log_api.log, return_value=None)
        mock_log([], 'test')
        mock_log.assert_called_once_with([], 'test')

        with self.assertRaises(TypeError):
            mock_log()

        with self.assertRaises(TypeError):
            mock_log('test')

We used the unittest.create_autospec() function to mock the functions under test. Mock the Python logging package as follows:

    @patch('dautil.log_api.logging')
    def test_get_logger_fileConfig(self, mock_logging):
        log_api.get_logger('test')
        self.assertTrue(mock_logging.config.fileConfig.called)

The @patch decorator replaces logging with a mock. We can also patch with similarly named functions. The patching trick is quite useful. Test our get_logger() function with the following method:

    @patch('dautil.log_api.get_logger')
    def test_log_debug(self, amock):
        log_api.log({}, 'test')
        self.assertTrue(amock.return_value.debug.called)
        amock.return_value.debug.assert_called_once_with(
                'Inside the log function')

The previous lines check whether debug() was called and with which arguments. The following two test methods demonstrate how to use multiple @patch decorators:

    @patch('dautil.log_api.get_distribution')
    @patch('dautil.log_api.get_logger')
    def test_numpy(self, m_get_logger, m_get_distribution):
        log_api.log({'numpy.version': ''}, 'test')
        m_get_distribution.assert_called_once_with('numpy')
        self.assertTrue(m_get_logger.return_value.info.called)

    @patch('dautil.log_api.get_distribution')
    @patch('dautil.log_api.get_logger')
    def test_distutils(self, amock, m_get_distribution):
        log_api.log({'distutils.version': ''}, 'test')
        self.assertFalse(m_get_distribution.called)

How it works...

Mocking is a technique to spy on objects and functions. We substitute them with our own spies, which we give just enough information to avoid detection. The spies report to us who contacted them and any useful information they received.

Configuring pandas

The pandas library has more than a dozen configuration options, as described in http://pandas.pydata.org/pandas-docs/dev/options.html (retrieved July 2015).

Note

The pandas library is Python open source software originally created for econometric data analysis. It uses data structures inspired by the R programming language.

You can set and get properties using dotted notation or via functions. It is also possible to reset options to defaults and get information about them. The option_context() function allows you to limit the scope of the option to a context using the Python with statement. In this recipe, I will demonstrate pandas configuration and a simple API to set and reset options I find useful. The two options are precision and max_rows. The first option specifies floating point precision of output. The second option specifies the maximum rows of a pandas DataFrame to print on the screen.

Getting ready

You need pandas and NumPy for this recipe. Instructions to install NumPy are given in Learning to log for robust error checking. The pandas installation documentation can be found at http://pandas.pydata.org/pandas-docs/dev/install.html (retrieved July 2015). The recommended way to install pandas is via Anaconda. I have installed pandas 0.16.2 via Anaconda. You can update your Anaconda pandas with the following command:

$ conda update pandas

How to do it...

The following code from the options.py file in dautil defines a simple API to set and reset options:

import pandas as pd

def set_pd_options():
    pd.set_option('precision', 4) 
    pd.set_option('max_rows', 5)

def reset_pd_options():
    pd.reset_option('precision') 
    pd.reset_option('max_rows')

The script in configure_pd.py in this book's code bundle uses the following API:

from dautil import options
import pandas as pd
import numpy as np
from dautil import log_api

printer = log_api.Printer()
print(pd.describe_option('precision'))
print(pd.describe_option('max_rows'))

printer.print('Initial precision', pd.get_option('precision'))
printer.print('Initial max_rows', pd.get_option('max_rows'))

# Random pi's, should use random state if possible
np.random.seed(42)
df = pd.DataFrame(np.pi * np.random.rand(6, 2))
printer.print('Initial df', df)

options.set_pd_options()
printer.print('df with different options', df)

options.reset_pd_options()
printer.print('df after reset', df)

If you run the script, you get descriptions for the options that are a bit too long to display here. The getter gives the following output:

'Initial precision'
7

'Initial max_rows'
60

Then, we create a pandas DataFrame table with random data. The initial printout looks like this:

'Initial df'
          0         1
0  1.176652  2.986757
1  2.299627  1.880741
2  0.490147  0.490071
3  0.182475  2.721173
4  1.888459  2.224476
5  0.064668  3.047062

The printout comes from the following class in log_api.py:

class Printer():
    def __init__(self, modules=None, name=None):
        if modules and name:
            log(modules, name)

    def print(self, *args):
        for arg in args:
            pprint.pprint(arg)

After setting the options with the dautil API, pandas hides some of the rows and the floating point numbers look different too:

'df with different options'
        0      1
0   1.177  2.987
1   2.300  1.881
..    ...    ...
4   1.888  2.224
5   0.065  3.047

[6 rows x 2 columns]

Because of the truncated rows, pandas tells us how many rows and columns the DataFrame table has. After we reset the options, we get the original printout back.

Configuring matplotlib

The matplotlib library allows configuration via the matplotlibrc files and Python code. The last option is what we are going to do in this recipe. Small configuration tweaks should not matter if your data analysis is strong. However, it doesn't hurt to have consistent and attractive plots. Another option is to apply stylesheets, which are files comparable to the matplotlibrc files. However, in my opinion, the best option is to use Seaborn on top of matplotlib. I will discuss Seaborn and matplotlib in more detail in Chapter 2, Creating Attractive Data Visualizations.

Getting ready

You need to install matplotlib for this recipe. Visit http://matplotlib.org/users/installing.html (retrieved July 2015) for more information. I have matplotlib 1.4.3 via Anaconda. Install Seaborn using Anaconda:

$ conda install seaborn

I have installed Seaborn 0.6.0 via Anaconda.

How to do it...

We can set options via a dictionary-like variable. The following function from the options.py file in dautil sets three options:

def set_mpl_options():
    mpl.rcParams['legend.fancybox'] = True
    mpl.rcParams['legend.shadow'] = True
    mpl.rcParams['legend.framealpha'] = 0.7

The first three options have to do with legends. The first option specifies rounded corners for the legend, the second options enables showing a shadow, and the third option makes the legend slightly transparent. The matplotlib rcdefaults() function resets the configuration.

To demonstrate these options, let's use sample data made available by matplotlib. The imports are as follows:

import matplotlib.cbook as cbook
import pandas as pd
import matplotlib.pyplot as plt
from dautil import options
import matplotlib as mpl
from dautil import plotting
import seaborn as sns

The data is in a CSV file and contains stock price data for AAPL. Use the following commands to read the data and stores them in a pandas DataFrame:

data = cbook.get_sample_data('aapl.csv', asfileobj=True)
df = pd.read_csv(data, parse_dates=True, index_col=0)

Resample the data to average monthly values as follows:

df = df.resample('M')

The full code is in the configure_matplotlib.ipynb file in this book's code bundle:

import matplotlib.cbook as cbook
import pandas as pd
import matplotlib.pyplot as plt
from dautil import options
import matplotlib as mpl
from dautil import plotting
import seaborn as sns


data = cbook.get_sample_data('aapl.csv', asfileobj=True)
df = pd.read_csv(data, parse_dates=True, index_col=0)
df = df.resample('M')
close = df['Close'].values
dates = df.index.values
fig, axes = plt.subplots(4)


def plot(title, ax):
    ax.set_title(title)
    ax.set_xlabel('Date')

    plotter = plotting.CyclePlotter(ax)
    plotter.plot(dates, close, label='Close')
    plotter.plot(dates, 0.75 * close, label='0.75 * Close')
    plotter.plot(dates, 1.25 * close, label='1.25 * Close')

    ax.set_ylabel('Price ($)')
    ax.legend(loc='best')

plot('Initial', axes[0])
sns.reset_orig()
options.set_mpl_options()

plot('After setting options', axes[1])

sns.reset_defaults()
plot('After resetting options', axes[2])

with plt.style.context(('dark_background')):
    plot('With dark_background stylesheet', axes[3])
    fig.autofmt_xdate()
    plt.show()

The program plots the data and arbitrary upper and lower band with the default options, custom options, and after a reset of the options. I used the following helper class from the plotting.py file of dautil:

from itertools import cycle


class CyclePlotter():
    def __init__(self, ax):
        self.STYLES = cycle(["-", "--", "-.", ":"])
        self.LW = cycle([1, 2])
        self.ax = ax

    def plot(self, x, y, *args, **kwargs):
        self.ax.plot(x, y, next(self.STYLES),
                     lw=next(self.LW), *args, **kwargs)

The class cycles through different line styles and line widths. Refer to the following plot for the end result:

How it works...

Importing Seaborn dramatically changes the look and feel of matplotlib plots. Just temporarily comment the seaborn lines out to convince yourself. However, Seaborn doesn't seem to play nicely with the matplotlib options we set, unless we use the Seaborn functions reset_orig() and reset_defaults().

Seeding random number generators and NumPy print options

For reproducible data analysis, we should prefer deterministic algorithms. Some algorithms use random numbers, but in practice we rarely use perfectly random numbers. The algorithms provided in numpy.random allow us to specify a seed value. For reproducibility, it is important to always provide a seed value but it is easy to forget. A utility function in sklearn.utils provides a solution for this issue.

NumPy has a set_printoptions() function, which controls how NumPy prints arrays. Obviously, printing should not influence the quality of your analysis too much. However, readability is important if you want people to understand and reproduce your results.

Getting ready

Install NumPy using the instructions in the Learning to log for robust error checking recipe. We will need scikit-learn, so have a look at http://scikit-learn.org/dev/install.html (retrieved July 2015). I have installed scikit-learn 0.16.1 via Anaconda.

How to do it...

The code for this example is in the configure_numpy.py file in this book's code bundle:

from sklearn.utils import check_random_state
import numpy as np
from dautil import options
from dautil import log_api

random_state = check_random_state(42)
a = random_state.randn(5)

random_state = check_random_state(42)
b = random_state.randn(5)

np.testing.assert_array_equal(a, b)

printer = log_api.Printer()
printer.print("Default options", np.get_printoptions())

pi_array = np.pi * np.ones(30)
options.set_np_options()
print(pi_array)

# Reset
options.reset_np_options()
print(pi_array)

The highlighted lines show how to get a NumPy RandomState object with 42 as the seed. In this example, the arrays a and b are equal, because we used the same seed and the same procedure to draw the numbers. The second part of the preceding program uses the following functions I defined in options.py:

def set_np_options():
    np.set_printoptions(precision=4, threshold=5,
                        linewidth=65)


def reset_np_options():
    np.set_printoptions(precision=8, threshold=1000,
                        linewidth=75)

Here's the output after setting the options:

[ 3.1416  3.1416  3.1416 ...,  3.1416  3.1416  3.1416]

As you can see, NumPy replaces some of the values with an ellipsis and it shows only four digits after the decimal sign. The NumPy defaults are as follows:

'Default options'
{'edgeitems': 3,
 'formatter': None,
 'infstr': 'inf',
 'linewidth': 75,
 'nanstr': 'nan',
 'precision': 8,
 'suppress': False,
 'threshold': 1000}

Standardizing reports, code style, and data access

Following a code style guide helps improve code quality. Having high-quality code is important if you want people to easily reproduce your analysis. One way to adhere to a coding standard is to scan your code with static code analyzers. You can use many code analyzers. In this recipe, we will use the pep8 analyzer. In general, code analyzers complement or maybe slightly overlap each other, so you are not limited to pep8.

Convenient data access is crucial for reproducible analysis. In my opinion, the best type of data access is with a specialized API and local data. I will introduce a dautil module I created to load weather data provided by the Dutch KNMI.

Reporting is often the last phase of a data analysis project. We can report our findings using various formats. In this recipe, we will focus on tabulating our report with the tabulate module. The landslide tool creates slide shows from various formats such as reStructured text.

Getting ready

You will need pep8 and tabulate. A quick guide to pep8 is available at https://pep8.readthedocs.org/en/latest/intro.html (retrieved July 2015). I have installed pep8 1.6.2 via Anaconda. You can install joblib, tabulate, and landslide with the pip command.

I have tabulate 0.7.5 and landslide 1.1.3.

How to do it...

Here's an example pep8 session:

$ pep8 --first log_api.py
log_api.py:21:1: E302 expected 2 blank lines, found 1
log_api.py:44:33: W291 trailing whitespace
log_api.py:50:60: E225 missing whitespace around operator

The –-first switch finds the first occurrence of an error. In the previous example, pep8 reports the line number where the error occurred, an error code, and a short description of the error. I prepared a module dedicated to data access of datasets we will use in several chapters. We start with access to a pandas DataFrame stored in a pickle, which contains selected weather data from the De Bilt weather station in the Netherlands. I created the pickle by downloading a zip file, extracting the data file, and loading the data in a pandas DataFrame table. I applied minimal data transformation, including multiplication of values and converting empty fields to NaNs. The code is in the data.py file in dautil. I will not discuss this code in detail, because we only need to load data from the pickle. However, if you want to download the data yourself, you can use the static method I defined in data.py. Downloading the data will of course give you more recent data, but you will get slightly different results if you substitute my pickle. The following code shows the basic descriptive statistics with the pandas.DataFrame.describe() method in the report_weather.py file in this book's code bundle:

from dautil import data
from dautil import report
import pandas as pd
import numpy as np
from tabulate import tabulate


df = data.Weather.load()
headers = [data.Weather.get_header(header) 
           for header in df.columns.values.tolist()]
df = df.describe()

Then, the code creates a slides.rst file in the reStructuredText format with dautil.RSTWriter. This is just a matter of simple string concatenation and writing to a file. The highlighted lines in the following code show the tabulate() calls that create table grids from the pandas.DataFrame objects:

writer = report.RSTWriter()
writer.h1('Weather Statistics')
writer.add(tabulate(df, headers=headers, 
        tablefmt='grid', floatfmt='.2f'))
writer.divider()
headers = [data.Weather.get_header(header) 
           for header in df.columns.values.tolist()]
builder = report.DFBuilder(df.columns)
builder.row(df.iloc[7].values - df.iloc[3].values)
builder.row(df.iloc[6].values - df.iloc[4].values)
df = builder.build(['ptp', 'iqr'])]

writer.h1('Peak-to-peak and Interquartile Range')

writer.add(tabulate(df, headers=headers, 
        tablefmt='grid', floatfmt='.2f'))
writer.write('slides.rst')
generator = report.Generator('slides.rst', 'weather_report.html')
generator.generate()

I use the dautil.reportDFBuilder class to create the pandas.DataFrame objects incrementally using a dictionary where the keys are columns of the final DataFrame table and the values are the rows:

import pandas as pd

class DFBuilder():
    def __init__(self, cols, *args):
        self.columns = cols
        self.df = {}

        for col in self.columns:
            self.df.update({col: []})

        for arg in args:
            self.row(arg)

    def row(self, row):
        assert len(row) == len(self.columns)

        for col, val in zip(self.columns, row):
            self.df[col].append(val)

        return self.df

    def build(self, index=None):
        self.df = pd.DataFrame(self.df)

        if index:
            self.df.index = index

        return self.df

I eventually generate a HTML file using landslide and my own custom CSS. If you open weather_report.html, you will see the first slide with basic descriptive statistics:

The second slide looks like this and contains the peak-to-peak (difference between minimum and maximum values) and the interquartile range (difference between the third and first quartile):