Packt+ | Advance your knowledge in tech

You're reading from Getting Started with Python Data Analysis

Product type Book

Published in Nov 2015

Publisher

ISBN-13 9781785285110

Pages 188 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Table of Contents (15) Chapters

Getting Started with Python Data Analysis

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

1. Introducing Data Analysis and Libraries

2. NumPy Arrays and Vectorized Computation

3. Data Analysis with Pandas

4. Data Visualization

5. Time Series

6. Interacting with Databases

7. Data Analysis Application Examples

8. Machine Learning Models with scikit-learn

Index

Chapter 3. Data Analysis with Pandas

In this chapter, we will explore another data analysis library called Pandas. The goal of this chapter is to give you some basic knowledge and concrete examples for getting started with Pandas.

An overview of the Pandas package

Pandas is a Python package that supports fast, flexible, and expressive data structures, as well as computing functions for data analysis. The following are some prominent features that Pandas supports:

Data structure with labeled axes. This makes the program clean and clear and avoids common errors from misaligned data.
Flexible handling of missing data.
Intelligent label-based slicing, fancy indexing, and subset creation of large datasets.
Powerful arithmetic operations and statistical computations on a custom axis via axis label.
Robust input and output support for loading or saving data from and to files, databases, or HDF5 format.

Related to Pandas installation, we recommend an easy way, that is to install it as a part of Anaconda, a cross-platform distribution for data analysis and scientific computing. You can refer to the reference at http://docs.continuum.io/anaconda/ to download and install the library.

After installation, we can use it like other Python...

The Pandas data structure

Let's first get acquainted with two of Pandas' primary data structures: the Series and the DataFrame. They can handle the majority of use cases in finance, statistic, social science, and many areas of engineering.

Series

A Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:

>>> s1 = pd.Series(np.random.rand(4),
                   index=['a', 'b', 'c', 'd'])
>>> s1
a    0.6122
b    0.98096
c    0.3350
d    0.7221
dtype: float64

By default, if no index is passed, it will be created to have values ranging from 0 to N-1, where N is the length of the Series:

>>> s2 = pd.Series(np.random.rand(4))
>>> s2
0    0.6913
1    0.8487
2    0.8627
3    0.7286
dtype: float64

We can access the value of a Series by using the index:

>>> s1['c']
0.3350
>>>s1['c'] = 3.14
>>> s1['c', 'a', 'b']
c    3.14
a    0.6122
b    0.98096

This accessing...

The essential basic functionality

Pandas supports many essential functionalities that are useful to manipulate Pandas data structures. In this book, we will focus on the most important features regarding exploration and analysis.

Reindexing and altering labels

Reindex is a critical method in the Pandas data structures. It confirms whether the new or modified data satisfies a given set of labels along a particular axis of Pandas object.

First, let's view a reindex example on a Series object:

>>> s2.reindex([0, 2, 'b', 3])
0    0.6913
2    0.8627
b    NaN
3    0.7286
dtype: float64

When reindexed labels do not exist in the data object, a default value of NaN will be automatically assigned to the position; this holds true for the DataFrame case as well:

>>> df1.reindex(index=[0, 2, 'b', 3],
        columns=['Density', 'Year', 'Median_Age','C'])
   Density  Year  Median_Age        C
0      244  2000        24.2      NaN
2      268  2010        28.5      NaN
b      NaN   NaN  ...

Indexing and selecting data

In this section, we will focus on how to get, set, or slice subsets of Pandas data structure objects. As we learned in previous sections, Series or DataFrame objects have axis labeling information. This information can be used to identify items that we want to select or assign a new value to in the object:

>>> s4[['024', '002']]    # selecting data of Series object
024     NaN
002    Mary
dtype: object
>>> s4[['024', '002']] = 'unknown' # assigning data
>>> s4
024    unknown
065        NaN
002    unknown
001        Nam
dtype: object

If the data object is a DataFrame structure, we can also proceed in a similar way:

>>> df5[['b', 'c']]
   b  c
0  1  2
1  4  5
2  7  8

For label indexing on the rows of DataFrame, we use the ix function that enables us to select a set of rows and columns in the object. There are two parameters that we need to specify: the row and column labels that we want to get. By default, if we do not specify...

Computational tools

Let's start with correlation and covariance computation between two data objects. Both the Series and DataFrame have a cov method. On a DataFrame object, this method will compute the covariance between the Series inside the object:

>>> s1 = pd.Series(np.random.rand(3))
>>> s1
0    0.460324
1    0.993279
2    0.032957
dtype: float64
>>> s2 = pd.Series(np.random.rand(3))
>>> s2
0    0.777509
1    0.573716
2    0.664212
dtype: float64
>>> s1.cov(s2)
-0.024516360159045424

>>> df8 = pd.DataFrame(np.random.rand(12).reshape(4,3),  
                       columns=['a','b','c'])
>>> df8
          a         b         c
0  0.200049  0.070034  0.978615
1  0.293063  0.609812  0.788773
2  0.853431  0.243656  0.978057
0.985584  0.500765  0.481180
>>> df8.cov()
          a         b         c
a  0.155307  0.021273 -0.048449
b  0.021273  0.059925 -0.040029
c -0.048449 -0.040029  0.055067

Usage of the correlation...

Working with missing data

In this section, we will discuss missing, NaN, or null values, in Pandas data structures. It is a very common situation to arrive with missing data in an object. One such case that creates missing data is reindexing:

>>> df8 = pd.DataFrame(np.arange(12).reshape(4,3),  
                       columns=['a', 'b', 'c'])
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
>>> df9 = df8.reindex(columns = ['a', 'b', 'c', 'd'])
   a   b   c   d
0  0   1   2 NaN
1  3   4   5 NaN
2  6   7   8 NaN
4  9  10  11 NaN
>>> df10 = df8.reindex([3, 2, 'a', 0])
    a   b   c
3   9  10  11
2   6   7   8
a NaN NaN NaN
0   0   1   2

To manipulate missing values, we can use the isnull() or notnull() functions to detect the missing values in a Series object, as well as in a DataFrame object:

>>> df10.isnull()
       a      b      c
3  False  False  False
2  False  False  False
a   True   True   True
0  False  False  False

On a Series, we...

Advanced uses of Pandas for data analysis

In this section we will consider some advanced Pandas use cases.

Hierarchical indexing

Hierarchical indexing provides us with a way to work with higher dimensional data in a lower dimension by structuring the data object into multiple index levels on an axis:

>>> s8 = pd.Series(np.random.rand(8), index=[['a','a','b','b','c','c', 'd','d'], [0, 1, 0, 1, 0,1, 0, 1, ]])
>>> s8
a  0    0.721652
   1    0.297784
b  0    0.271995
   1    0.125342
c  0    0.444074
   1    0.948363
d  0    0.197565
   1    0.883776
dtype: float64

In the preceding example, we have a Series object that has two index levels. The object can be rearranged into a DataFrame using the unstack function. In an inverse situation, the stack function can be used:

>>> s8.unstack()
          0         1
a  0.549211  0.420874
b  0.051516  0.715021
c  0.503072  0.720772
d  0.373037  0.207026

We can also create a DataFrame to have a hierarchical index in both axes...

Summary

We have finished covering the basics of the Pandas data analysis library. Whenever you learn about a library for data analysis, you need to consider the three parts that we explained in this chapter. Data structures: we have two common data object types in the Pandas library; Series and DataFrames. Method to access and manipulate data objects: Pandas supports many way to select, set or slice subsets of data object. However, the general mechanism is using index labels or the positions of items to identify values. Functions and utilities: They are the most important part of a powerful library. In this chapter, we covered all common supported functions of Pandas which allow us compute statistics on data easily. The library also has a lot of other useful functions and utilities that we could not explain in this chapter. We encourage you to start your own research, if you want to expand your experience with Pandas. It helps us to process large data in an optimized way. You will see more...