Home

Data

Python Data Analysis - Second Edition

By Ivan Idris

Book

eBook $43.99 $29.99

Print $54.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $43.99 $29.99

Print $54.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Data analysis techniques generate useful insights from small and large volumes of data. Python, with its strong set of libraries, has become a popular platform to conduct various data analysis and predictive modeling tasks. With this book, you will learn how to process and manipulate data with Python for complex analysis and modeling. We learn data manipulations such as aggregating, concatenating, appending, cleaning, and handling missing values, with NumPy and Pandas. The book covers how to store and retrieve data from various data sources such as SQL and NoSQL, CSV fies, and HDF5. We learn how to visualize data using visualization libraries, along with advanced topics such as signal processing, time series, textual data analysis, machine learning, and social media analysis. The book covers a plethora of Python modules, such as matplotlib, statsmodels, scikit-learn, and NLTK. It also covers using Python with external environments such as R, Fortran, C/C++, and Boost libraries.

Publication date:: March 2017
Publisher: Packt
Pages: 330
ISBN: 9781787127487
Download code from GitHub

Chapter 1. Getting Started with Python Libraries

Welcome! Let's get started. Python has become one of the de facto standard language and platform for data analysis and data science. The mind map that you will see shortly depicts some of the numerous libraries available in the Python ecosystem that are used by data analysts and data scientists. NumPy, SciPy, Pandas, and Matplotlib libraries lay the foundation of Python data analysis and are now part of SciPy Stack 1.0 (http://www.scipy.org/stackspec.html). We will learn how to install SciPy Stack 1.0 and Jupyter Notebook, and write some simple data analysis code as a warm-up exercise.

The following are the libraries available in the Python ecosystem that are used by data analysts and data scientists:

NumPy: This is a general-purpose library that provides numerical arrays, and functions to manipulate the arrays efficiently.
SciPy: This is a scientific computing library that provides science and engineering related functions. SciPy supplements and slightly overlaps NumPy. NumPy and SciPy historically shared their code base but were later separated.
Pandas: This is a data-manipulation library that provides data structures and operations for manipulating tables and time series data.
Matplotlib: This is a 2D plotting library that provides support for producing plots, graphs, and figures. Matplotlib is used by SciPy and supports NumPy.
IPython: This provides a powerful interactive shell for Python, kernel for Jupyter, and support for interactive data visualization. We will cover the IPython shell later in this chapter.
Jupyter Notebook: This provides a web-based interactive shell for creating and sharing documents with live code and visualizations. Jupyter Notebook supports multiple versions of Python through the kernel provided by IPython. We will cover the Jupyter Notebook later in this chapter.

Installation instructions for the other required software will be given throughout the book at the appropriate time. At the end of this chapter, you will find pointers on how to find additional information online if you get stuck or are uncertain about the best way of solving problems:

In this chapter, we will cover the following topics:

Installing Python 3
Using IPython as a shell
Reading manual pages
Jupyter Notebook
NumPy arrays
A simple application
Where to find help and references
Listing modules inside the Python libraries
Visualizing data using matplotlib

Installing Python 3

The software used in this book is based on Python 3, so you need to have Python 3 installed. On some operating systems, Python 3 is already installed. There are many implementations of Python, including commercial implementations and distributions. In this book, we will focus on the standard Python implementation, which is guaranteed to be compatible with NumPy.

Note

You can download Python 3.5.x from https://www.python.org/downloads/. On this web page, you can find installers for Windows and Mac OS X, as well as source archives for Linux, Unix, and Mac OS X. You can find instructions for installing and using Python for various operating systems at https://docs.python.org/3/using/index.html.

The software we will install in this chapter has binary installers for Windows, various Linux distributions, and Mac OS X. There are also source distributions, if you prefer. You need to have Python 3.5.x or above installed on your system. The sunset date for Python 2.7 was moved from 2015 to 2020, thus Python 2.7 will be supported and maintained until 2020. For these reasons, we have updated this book for Python 3.

Installing data analysis libraries

We will learn how to install and set up NumPy, SciPy, Pandas, Matplotlib, IPython, and Jupyter Notebook on Windows, Linux, and Mac OS X. Let's look at the process in detail. We shall use pip3 to install the libraries. From version 3.4 onwards, pip3 has been included by default with the Python installation.

On Linux or Mac OS X

To install the foundational libraries, run the following command line instruction:

$ pip3 install numpy scipy pandas matplotlib jupyter notebook

It may be necessary to prepend sudo to this command if your current user doesn't have sufficient rights on your system.

On Windows

At the time of writing this book, we had the following software installed as a prerequisite on our Windows 10 virtual machine:

Python 3.6 from https://www.python.org/ftp/python/3.6.0/python-3.6.0-amd64.exe
Microsoft Visual C++ Build Tools 2015 from http://landinghub.visualstudio.com/visual-cpp-build-tools

Download and install the appropriate prebuilt NumPy and Scipy binaries for your Windows platform from http://www.lfd.uci.edu/~gohlke/pythonlibs/:

We downloaded numpy-1.12.0+mkl-cp36-cp36m-win_amd64.whl and scipy-0.18.1-cp36-cp36m-win_amd64.whl
After downloading, we executed the pip3 install Downloads\numpy-1.12.0+mkl-cp36-cp36m-win_amd64.whl and pip3 install Downloads\scipy-0.18.1-cp36-cp36m-win_amd64.whl commands

After these prerequisites are installed, to install the rest of the foundational libraries, run the following command line instruction:

$ pip3 install pandas matplotlib jupyter

Tip

Installing Jupyter using these commands, installs all the required packages, such as Notebook and IPython.

Using IPython as a shell

Data analysts, data scientists, and engineers are used to experimenting. IPython was created by scientists with experimentation in mind. The interactive environment that IPython provides is comparable to an interactive computing environment provided by Matlab, Mathematica, and Maple.

The following is a list of features of the IPython shell:

Tab completion, which helps you find a command
History mechanism
Inline editing
Ability to call external Python scripts with %run
Access to system commands
Access to the Python debugger and profiler

The following list describes how to use the IPython shell:

Starting a session: To start a session with IPython,enter the following instruction on the command line:

$ ipython3
Python 3.5.2 (default, Sep 28 2016, 18:08:09) 
Type "copyright", "credits" or "license" for more information.
        IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra 
                     details.
In [1]: quit()

Tip

The quit() function or Ctrl + D quits the IPython shell.

Saving a session: We might want to be able to go back to our experiments. In IPython, it is easy to save a session for later use with the following command:

In [1]: %logstart
Activating auto-logging. Current session state plus future 
         input saved:
         Filename : ipython_log.py
         Mode : rotate
         Output logging : False
         Raw input log : False
         Timestamping : False
State : active

Logging can be switched off as follows:

In [9]: %logoff
Switching logging OFF

Executing a system shell command: Execute a system shell command in the default IPython profile by prefixing the command with the ! symbol. For instance, the following input will get the current date:
```
In [1]: !date
```
In fact, any line prefixed with ! is sent to the system shell. We can also store the command output, as shown here:
```
In [2]: thedate = !date
In [3]: thedate
```
Displaying history: We can show the history of our commands with the %hist command. For example:
```
In [1]: a = 2 + 2
In [2]: a
Out[2]: 4
In [3]: %hist
a = 2 + 2
a
%hist
```
This is a common feature in command line interface (CLI) environments. We can also search through the history with the -g switch as follows:
```
In [5]: %hist -g a = 2
      1: a = 2 + 2
```

We saw a number of so-called magic functions in action. These functions start with the % character. If the magic function is used on a line by itself, the % prefix is optional.

Reading manual pages

When the libraries are imported in IPython, we can open manual pages for library functions with the help command. It is not necessary to know the name of a function. We can type a few characters and then let the tab completion do its work. Let's, for instance, browse the available information for the arange() function.

We can browse the available information in either of the following two ways:

Calling the help function: Type in help( followed by a few characters of the function and press the Tab key. A list of functions will appear. Select the function from the list using the arrow keys and press the Enter key. Close the help function call with ) and press the Enter key.
Querying with a question mark: Another option is to append a question mark to the function name. You will then, of course, need to know the function name, but you don't have to type help, for example:
```
In [3]: numpy.arange?
```
Tab completion is dependent on readline, so you need to make sure that it is installed. It can be installed with pip by typing the following command:
```
$ pip3 install readline
```
The question mark gives you information from docstrings.

Jupyter Notebook

Jupyter Notebook, previously known as IPython Notebooks, provides a tool to create and share web pages with text, charts, and Python code in a special format. Have a look at these notebook collections at the following links:

Often, the notebooks are used as an educational tool, or to demonstrate Python software. We can import or export notebooks either from plain Python code or from the special notebook format. The notebooks can be run locally, or we can make them available online by running a dedicated notebook server. Certain cloud computing solutions, such as Wakari and PiCloud, allow you to run notebooks in the cloud. Cloud computing is one of the topics of Chapter 11, Environments Outside the Python Ecosystem and Cloud Computing.

To start a session with Jupyter Notebook,enter the following instruction on the command line:

$ jupyter-notebook

This will start the notebook server and open a web page showing the contents of the folder from which the command will execute. You can then select New | Python 3 to start a new notebook in Python 3.

You can also open ch-01.ipynb, provided in the code package for this book. The ch-01 notebook file has the code for the simple applications that we will describe shortly.

NumPy arrays

After going through the installation of NumPy, it's time to have a look at NumPy arrays. NumPy arrays are more efficient than Python lists when it comes to numerical operations. NumPy arrays are, in fact, specialized objects with extensive optimizations. NumPy code requires less explicit loops than equivalent Python code. This is based on vectorization.

If we go back to high school mathematics, then we should remember the concepts of scalars and vectors. The number 2, for instance, is a scalar. When we add 2 to 2, we are performing scalar addition. We can form a vector out of a group of scalars. In Python programming terms, we will then have a one-dimensional array. This concept can, of course, be extended to higher dimensions. Performing an operation on two arrays, such as addition, can be reduced to a group of scalar operations. In straight Python, we will do that with loops going through each element in the first array and adding it to the corresponding element in the second array. However, this is more verbose than the way it is done in mathematics. In mathematics, we treat the addition of two vectors as a single operation. That's the way NumPy arrays do it too, and there are certain optimizations using low-level C routines that make these basic operations more efficient. We will cover NumPy arrays in more detail in the Chapter 2, NumPy Arrays.

A simple application

Imagine that we want to add two vectors called a and b. The word vector is used here in the mathematical sense, which means a one-dimensional array. We will learn about specialized NumPy arrays that represent matrices in Chapter 4, Statistics and Linear Algebra. The vector a holds the squares of integers 0 to n; for instance, if n is equal to 3, a contains 0, 1, or 4. The vector b holds the cubes of integers 0 to n, so if n is equal to 3, then the vector b is equal to 0, 1, or 8. How would you do that using plain Python? After we come up with a solution, we will compare it to the NumPy equivalent.

The following function solves the vector addition problem using pure Python without NumPy:

def pythonsum(n): 
   a = list(range(n)) 
   b = list(range(n)) 
   c = [] 
 
   for i in range(len(a)): 
       a[i] = i ** 2 
       b[i] = i ** 3 
       c.append(a[i] + b[i]) 
 
   return c

The following is a function that solves the vector addition problem with NumPy:

def numpysum(n): 
  a = numpy.arange(n) ** 2 
  b = numpy.arange(n) ** 3 
  c = a + b 
  return c

Note that numpysum() does not need a for loop. We also used the arange() function from NumPy, which creates a NumPy array for us with integers from 0 to n. The arange() function was imported; that is why it is prefixed with numpy.

Now comes the fun part. We mentioned earlier that NumPy is faster when it comes to array operations. How much faster is Numpy, though? The following program will show us by measuring the elapsed time in microseconds for the numpysum() and pythonsum() functions. It also prints the last two elements of the vector sum. Let's check that we get the same answers using Python and NumPy:

#!/usr/bin/env/python 
 
import sys 
from datetime import datetime 
import numpy as np 
 
""" 
This program demonstrates vector addition the Python way. 
Run the following from the command line: 
 
  python vectorsum.py n 
 
Here, n is an integer that specifies the size of the vectors. 
 
The first vector to be added contains the squares of 0 up to n. 
The second vector contains the cubes of 0 up to n. 
The program prints the last 2 elements of the sum and the elapsed  time: 
""" 
 
def numpysum(n): 
   a = np.arange(n) ** 2 
   b = np.arange(n) ** 3 
   c = a + b 
 
   return c 
 
def pythonsum(n): 
   a = list(range(n)) 
   b = list(range(n)) 
   c = [] 
 
   for i in range(len(a)): 
       a[i] = i ** 2 
       b[i] = i ** 3 
       c.append(a[i] + b[i]) 
 
   return c 
 
size = int(sys.argv[1]) 
 
start = datetime.now() 
c = pythonsum(size) 
delta = datetime.now() - start 
print("The last 2 elements of the sum", c[-2:]) 
print("PythonSum elapsed time in microseconds", delta.microseconds) 
 
start = datetime.now() 
c = numpysum(size) 
delta = datetime.now() - start 
print("The last 2 elements of the sum", c[-2:]) 
print("NumPySum elapsed time in microseconds", delta.microseconds)

The output of the program for 1000, 2000, and 3000 vector elements is as follows:

$ python3 vectorsum.py 1000
The last 2 elements of the sum [995007996, 998001000]
PythonSum elapsed time in microseconds 976
The last 2 elements of the sum [995007996 998001000]
NumPySum elapsed time in microseconds 87
$ python3 vectorsum.py 2000
The last 2 elements of the sum [7980015996, 7992002000]
PythonSum elapsed time in microseconds 1623
The last 2 elements of the sum [7980015996 7992002000]
NumPySum elapsed time in microseconds 143
$ python3 vectorsum.py 4000
The last 2 elements of the sum [63920031996, 63968004000]
PythonSum elapsed time in microseconds 3417
The last 2 elements of the sum [63920031996 63968004000]
NumPySum elapsed time in microseconds 237

Clearly, NumPy is much faster than the equivalent normal Python code. One thing is certain; we get the same results whether we are using NumPy or not. However, the result that is printed differs in representation. Note that the result from the numpysum() function does not have any commas. How come? Obviously, we are not dealing with a Python list, but with a NumPy array. We will learn more about NumPy arrays in the Chapter 2, NumPy Arrays.

Where to find help and references

The following table lists documentation websites for the Python data analysis libraries we discussed in this chapter.

Packages	Description
NumPy and SciPy	The main documentation website for NumPy and SciPy is at http://docs.scipy.org/doc/. Through this web page, you can browse NumPy and SciPy user guides and reference guides, as well as several tutorials.
Pandas	http://pandas.pydata.org/pandas-docs/stable/.
Matplotlib	http://matplotlib.org/contents.html.
IPython	http://ipython.readthedocs.io/en/stable/.
Jupyter Notebook	http://jupyter-notebook.readthedocs.io/en/latest/.

The popular Stack Overflow software development forum has hundreds of questions tagged NumPy, SciPy, Pandas, Matplotlib, IPython, and Jupyter Notebook. To view them, go to http://stackoverflow.com/questions/tagged/<your-tag-word-here>.

If you are really stuck with a problem, or you want to be kept informed of the development of these libraries, you can subscribe to their respective discussion mailing list(s). The number of e-mails per day varies from list to list. Developers actively involved with the development of these libraries answer some of the questions asked on the mailing lists.

For IRC users, there is an IRC channel on irc://irc.freenode.net. The channel is called #scipy, but you can also ask NumPy questions since SciPy users also have knowledge of NumPy, as SciPy is based on NumPy. There are at least 50 members on the SciPy channel at all times.

Listing modules inside the Python libraries

The ch-01.ipynb file contains the code for looking at the modules inside the NumPy, SciPy, Pandas, and Matplotlib libraries. Don't worry about understanding the code just trying to run it for now. You can modify this code to look at the modules inside other libraries as well.

Visualizing data using Matplotlib

We shall learn about visualizing the data in a later chapter. For now, let's try loading two sample datasets and building a basic plot. First, install the sklearn library from which we shall load the data using the following command:

$ pip3 install scikit-learn

Import the datasets using the following command:

from sklearn.datasets import load_iris 
from sklearn.datasets import load_boston

Import the Matplotlib plotting module:

from matplotlib import pyplot as plt 
%matplotlib inline

Load the iris dataset, print the description of the dataset, and plot column 1 (sepal length) as x and column 2 (sepal width) as y:

iris = load_iris() 
print(iris.DESCR) 
data=iris.data 
plt.plot(data[:,0],data[:,1],".")

The resulting plot will look like the following image:

Load the boston dataset, print the description of the dataset and plot column 3 (proportion of non-retail business) as x and column 5 (nitric oxide concentration) as y, each point on the plot marked with a + sign:

boston = load_boston()
print(boston.DESCR)
data=boston.data
plt.plot(data[:,2],data[:,4],"+")

The resulting plot will look like the following image:

Summary

In this chapter, we installed NumPy, SciPy, Pandas, Matplotlib, IPython, and Jupyter Notebook, all of which we will be using in this book. We got a vector addition program working, and learned how NumPy offers superior performance. In addition, we explored the available documentation and online resources. We executed code to find the modules inside the libraries and loaded some sample datasets to draw some basic plots using Matplotlib.

In the next chapter, Chapter 2, NumPy Arrays, we will take a look under the hood of NumPy and explore some fundamental concepts, including arrays and data types.

About the Author

Ivan Idris

Ivan Idris has an MSc in experimental physics. His graduation thesis had a strong emphasis on applied computer science. After graduating, he worked for several companies as a Java developer, data warehouse developer, and QA analyst. His main professional interests are business intelligence, big data, and cloud computing. Ivan Idris enjoys writing clean, testable code and interesting technical articles. Ivan Idris is the author of NumPy 1.5. Beginner's Guide and NumPy Cookbook by Packt Publishing.
Browse publications by this author

Nice. I’m reading this book for my job. It helps me a lot. Thanks.

Clear and concise - only just started it though.

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa