Python Data Analysis Utilities

Ivan Idris

February 2016

After the success of the book Python Data Analysis, Packt's acquisition editor Prachi Bisht gauged the interest of the author, Ivan Idris, in publishing Python Data Analysis Cookbook. According to Ivan, Python Data Analysis is one of his best books. Python Data Analysis Cookbook is meant for a bit more experienced Pythonistas and is written in the cookbook format. In the year after the release of Python Data Analysis, Ivan has received a lot of feedback—mostly positive, as far as he is concerned.

Although Python Data Analysis covers a wide range of topics, Ivan still managed to leave out a lot of subjects. He realized that he needed a library as a toolbox. Named dautil for data analysis utilities, the API was distributed by him via PyPi so that it is installable via pip/easy_install. As you know, Python 2 will no longer be supported after 2020, so dautil is based on Python 3.

For the sake of reproducibility, Ivan also published a Docker repository named pydacbk (for Python Data Analysis Cookbook). The repository represents a virtual image with preinstalled software. For practical reasons, the image doesn't contain all the software, but it still contains a fair percentage. This article has the following sections:

  • Data analysis, data science, big data – what is the big deal?
  • A brief history of data analysis with Python
  • A high-level overview of dautil
  • IPython notebook utilities
  • Downloading data
  • Plotting utilities
  • Demystifying Docker
  • Future directions

(For more resources related to this topic, see here.)

Data analysis, data science, big data – what is the big deal?

You've probably seen Venn diagrams depicting data science as the intersection of mathematics/statistics, computer science, and domain expertise. Data analysis is timeless and was there before data science and computer science. You could perform data analysis with a pen and paper and, in more modern times, with a pocket calculator.

Data analysis has many aspects with goals such as making decisions or coming up with new hypotheses and questions. The hype, status, and financial rewards surrounding data science and big data remind me of the time when data warehousing and business intelligence were the buzzwords. The ultimate goal of business intelligence and data warehousing was to build dashboards for management. This involved a lot of politics and organizational aspects, but on the technical side, it was mostly about databases. Data science, on the other hand, is not database-centric, and leans heavily on machine learning. Machine learning techniques have become necessary because of the bigger volumes of data. Data growth is caused by the growth of the world's population and the rise of new technologies such as social media and mobile devices. Data growth is in fact probably the only trend that we can be sure will continue. The difference between constructing dashboards and applying machine learning is analogous to the way search engines evolved.

Search engines (if you can call them that) were initially nothing more than well-organized collections of links created manually. Eventually, the automated approach won. Since more data will be created in time (and not destroyed), we can expect an increase in automated data analysis.

A brief history of data analysis with Python

The history of the various Python software libraries is quite interesting. I am not a historian, so the following notes are written from my own perspective:

  • 1989: Guido van Rossum implements the very first version of Python at the CWI in the Netherlands as a Christmas hobby project.
  • 1995: Jim Hugunin creates Numeric, the predecessor to NumPy.
  • 1999: Pearu Peterson writes f2py as a bridge between Fortran and Python.
  • 2000: Python 2.0 is released.
  • 2001: The SciPy library is released. Also, Numarray, a competing library of Numeric, is created. Fernando Perez releases IPython, which starts out as an afternoon hack. NLTK is released as a research project.
  • 2002: John Hunter creates the matplotlib library.
  • 2005: NumPy is released by Travis Oliphant. Initially, NumPy is Numeric extended with features inspired by Numarray.
  • 2006: NumPy 1.0 is released. The first version of SQLAlchemy is released.
  • 2007: The scikit-learn project is initiated as a Google Summer of Code project by David Cournapeau. Cython is forked from Pyrex. Cython is later intensively used in pandas and scikit-learn to improve performance.
  • 2008: Wes McKinney starts working on pandas. Python 3.0 is released.
  • 2011: The IPython 0.12 release introduces the IPython notebook. Packt releases NumPy 1.5 Beginner's Guide.
  • 2012: Packt releases NumPy Cookbook.
  • 2013: Packt releases NumPy Beginner's Guide - Second Edition.
  • 2014: Fernando Perez announces Project Jupyter, which aims to make a language-agnostic notebook. Packt releases Learning NumPy Array and Python Data Analysis.
  • 2015: Packt releases NumPy Beginner's Guide - Third Edition and NumPy Cookbook - Second Edition.

A high-level overview of dautil

The dautil API that Ivan made for this book is a humble toolbox, which he found useful. It is released under the MIT license. This license is very permissive, so you could in theory use the library in a production system. He doesn't recommend doing this currently (as of January, 2016), but he believes that the unit tests and documentation are of acceptable quality. The library has 3000+ lines of code and 180+ unit tests with a reasonable coverage. He has fixed as many issues reported by pep8 and flake8 as possible.

Some of the functions in dautil are on the short side and are of very low complexity. This is on purpose. If there is a second edition (knock on wood), dautil will probably be completely transformed. The API evolved as Ivan wrote the book under high time pressure, so some of the decisions he made may not be optimal in retrospect. However, he hopes that people find dautil useful and, ideally, contribute to it.

The dautil modules are summarized in the following table:

Module

Description

LOC

dautil.collect

Contains utilities related to collections

331

dautil.conf

Contains configuration utilities

48

dautil.data

Contains utilities to download and load data

468

dautil.db

Contains database-related utilities

98

dautil.log_api

Contains logging utilities

204

dautil.nb

Contains IPython/Jupyter notebook widgets and utilities

609

dautil.options

Configures dynamic options of several libraries related to data analysis

71

dautil.perf

Contains performance-related utilities

162

dautil.plotting

Contains plotting utilities

382

dautil.report

Contains reporting utilities

232

dautil.stats

Contains statistical functions and utilities

366

dautil.ts

Contains Utilities for time series and dates

217

dautil.web

Contains utilities for web mining and HTML processing

47

IPython notebook utilities

The IPython notebook has become a standard tool for data analysis. The dautil.nb has several interactive IPython widgets to help with Latex rendering, the setting of matplotlib properties, and plotting. Ivan has defined a Context class, which represents the configuration settings of the widgets. The settings are stored in a pretty-printed JSON file in the current working directory, which is named dautil.json. This could be extended, maybe even with a database backend. The following is an edited excerpt (so that it doesn't take up a lot of space) of an example dautil.json:

{

   ...

   "calculating_moments": {

       "figure.figsize": [ 10.4, 7.7 ],

       "font.size": 11.2

   },

   "calculating_moments.latex": [ 1, 2, 3, 4, 5, 6, 7 ],

   "launching_futures": {

       "figure.figsize": [ 11.5, 8.5 ]

   },

   "launching_futures.labels": [ [ {}, {

               "legend": "loc=best",

               "title": "Distribution of Means"

           }

       ],

       [

           {

               "legend": "loc=best",

               "title": "Distribution of Standard Deviation"

           },

           {

                "legend": "loc=best",

               "title": "Distribution of Skewness"

           }

       ]

   ],

...

}

 The Context object can be constructed with a string—Ivan recommends using the name of the notebook, but any unique identifier will do. The dautil.nb.LatexRenderer also uses the Context class. It is a utility class, which helps you number and render Latex equations in an IPython/Jupyter notebook, for instance, as follows:

import dautil as dl

 

lr = dl.nb.LatexRenderer(chapter=12, context=context)

lr.render(r'\delta\! = x - m')

lr.render(r'm\' = m + \frac{\delta}{n}')

lr.render(r'M_2\' = M_2 + \delta^2 \frac{ n-1}{n}')

lr.render(r'M_3\' = M_3 + \delta^3 \frac{ (n - 1) (n - 2)}{n^2}/

- \frac{3\delta M_2}{n}')

lr.render(r'M_4\' = M_4 + \frac{\delta^4 (n - 1) /

(n^2 - 3n + 3)}{n^3} + \frac{6\delta^2 M_2}/

{n^2} - \frac{4\delta M_3}{n}')

lr.render(r'g_1 = \frac{\sqrt{n} M_3}{M_2^{3/2}}')

lr.render(r'g_2 = \frac{n M_4}{M_2^2}-3.')

The following is the result:

  Python Data Analysis Cookbook

Another widget you may find useful is RcWidget, which sets matplotlib settings, as shown in the following screenshot:

 Python Data Analysis Cookbook

Downloading data

Sometimes, we require sample data to test an algorithm or prototype a visualization. In the dautil.data module, you will find many utilities for data retrieval. Throughout this book, Ivan has used weather data from the KNMI for the weather station in De Bilt. A couple of the utilities in the module add a caching layer on top of existing pandas functions, such as the ones that download data from the World Bank and Yahoo! Finance (the caching depends on the joblib library and is currently not very configurable). You can also get audio, demographics, Facebook, and marketing data.

The data is stored under a special data directory, which depends on the operating system. On the machine used in the book, it is stored under ~/Library/Application Support/dautil. The following example code loads data from the SPAN Facebook dataset and computes the clique number:

import networkx as nx

import dautil as dl

 

 

fb_file = dl.data.SPANFB().load()

G = nx.read_edgelist(fb_file,

                     create_using=nx.Graph(),

                     nodetype=int)

 

print('Graph Clique Number',

     nx.graph_clique_number(G.subgraph(list(range(2048)))))

 To understand what is going on in detail, you will need to read the book. In a nutshell, we load the data and use the NetworkX API to calculate a network metric.

Plotting utilities

Ivan visualizes data very often in the book. Plotting helps us get an idea about how the data is structured and helps you form hypotheses or research questions. Often, we want to chart multiple variables, but we want to easily see what is what. The standard solution in matplotlib is to cycle colors. However, Ivan prefers to cycle line widths and line styles as well. The following unit test demonstrates his solution to this issue:

 def test_cycle_plotter_plot(self):

       m_ax = Mock()

       cp = plotting.CyclePlotter(m_ax)

       cp.plot([0], [0])

       m_ax.plot.assert_called_with([0], [0], '-', lw=1)

       cp.plot([0], [1])

       m_ax.plot.assert_called_with([0], [1], '--', lw=2)

       cp.plot([1], [0])

       m_ax.plot.assert_called_with([1], [0], '-.', lw=1)

The dautil.plotting module currently also has a helper tool for subplots, histograms, regression plots, and dealing with color maps. The following example code (the code for the labels has been omitted) demonstrates a bar chart utility function and a utility function from dautil.data, which downloads stock price data:

import dautil as dl

import numpy as np

import matplotlib.pyplot as plt

 

 

ratios = []

STOCKS = ['AAPL', 'INTC', 'MSFT', 'KO', 'DIS', 'MCD', 'NKE', 'IBM']

 

for symbol in STOCKS:

   ohlc = dl.data.OHLC()

   P = ohlc.get(symbol)['Adj Close'].values

   N = len(P)

   mu = (np.log(P[-1]) - np.log(P[0]))/N

   var_a = 0

   var_b = 0

 

   for k in range(1, N):

       var_a = (np.log(P[k]) - np.log(P[k - 1]) - mu) ** 2

       var_a = var_a / N

 

   for k in range(1, N//2):

       var_b = (np.log(P[2 * k]) - np.log(P[2 * k - 2]) - 2 * mu) ** 2

       var_b = var_b / N

 

   ratios.append(var_b/var_a - 1)

 

_, ax = plt.subplots()

dl.plotting.bar(ax, STOCKS, ratios)

plt.show()

Refer to the following screenshot for the end result:

 Python Data Analysis Cookbook

The code performs a random walk test and calculates the corresponding ratio for a list of stock prices. The data is retrieved whenever you run the code, so you may get different results. Some of you have a finance aversion, but rest assured that this book has very little finance-related content.

The following script demonstrates a linear regression utility and caching downloader for World Bank data (the code for the watermark and plot labels has been omitted):

import dautil as dl

import matplotlib.pyplot as plt

import numpy as np

 

 

wb = dl.data.Worldbank()

countries = wb.get_countries()[['name', 'iso2c']]

inf_mort = wb.get_name('inf_mort')

gdp_pcap = wb.get_name('gdp_pcap')

df = wb.download(country=countries['iso2c'],

                 indicator=[inf_mort, gdp_pcap],

                start=2010, end=2010).dropna()

loglog = df.applymap(np.log10)

x = loglog[gdp_pcap]

y = loglog[inf_mort]

 

dl.options.mimic_seaborn()

fig, [ax, ax2] = plt.subplots(2, 1)

ax.set_ylim([0, 200])

ax.scatter(df[gdp_pcap], df[inf_mort])

ax2.scatter(x, y)

dl.plotting.plot_polyfit(ax2, x, y)

plt.show()

 The following image should be displayed by the code:

 Python Data Analysis Cookbook

The program downloads World Bank data for 2010 and plots the infant mortality rate against the GDP per capita. Also shown is a linear fit of the log-transformed data.

Demystifying Docker

Docker uses Linux kernel features to provide an extra virtualization layer. It was created in 2013 by Solomon Hykes. Boot2Docker allows us to install Docker on Windows and Mac OS X as well. Boot2Docker uses a VirtualBox VM that contains a Linux environment with Docker. Ivan's Docker image, which is mentioned in the introduction, is based on the continuumio/miniconda3 Docker image. The Docker installation docs are at https://docs.docker.com/index.html.

Once you install Boot2Docker, you need to initialize it. This is only necessary once, and Linux users don't need this step:

$ boot2docker init

The next step for Mac OS X and Windows users is to start the VM:

$ boot2docker start

Check the Docker environment by starting a sample container:

$ docker run hello-world

Docker images are organized in a repository, which resembles GitHub. A producer pushes images and a consumer pulls images. You can pull Ivan's repository with the following command. The size is currently 387 MB.

$ docker pull ivanidris/pydacbk

Future directions

The dautil API consists of items Ivan thinks will be useful outside of the context of this book. Certain functions and classes that he felt were only suitable for a particular chapter are placed in separate per-chapter modules, such as ch12util.py. In retrospect, parts of those modules may need to be included in dautil as well.

In no particular order, Ivan has the following ideas for future dautil development:

  • He is playing with the idea of creating a parallel library with "Cythonized" code, but this depends on how dautil is received
  • Adding more data loaders as required
  • There is a whole range of streaming (or online) algorithms that he thinks should be included in dautil as well
  • The GUI of the notebook widgets should be improved and extended
  • The API should have more configuration options and be easier to configure

Summary

In this article, Ivan roughly sketched what data analysis, data science, and big data are about. This was followed by a brief of history of data analysis with Python. Then, he started explaining dautil—the API he made to help him with this book. He gave a high-level overview and some examples of the IPython notebook utilities, features to download data, and plotting utilities.

He used Docker for testing and giving readers a reproducible data analysis environment, so he spent some time on that topic too. Finally, he mentioned the possible future directions that could be taken for the library in order to guide anyone who wants to contribute.

Resources for Article:

 


Further resources on this subject:


You've been reading an excerpt of:

Python Data Analysis Cookbook

Explore Title