Pandas is a Python package that supports fast, flexible, and expressive data structures, as well as computing functions for data analysis. The following are some prominent features that Pandas supports:
Data structure with labeled axes. This makes the program clean and clear and avoids common errors from misaligned data.
Flexible handling of missing data.
Intelligent label-based slicing, fancy indexing, and subset creation of large datasets.
Powerful arithmetic operations and statistical computations on a custom axis via axis label.
Robust input and output support for loading or saving data from and to files, databases, or HDF5 format.
Related to Pandas installation, we recommend an easy way, that is to install it as a part of Anaconda, a cross-platform distribution for data analysis and scientific computing. You can refer to the reference at http://docs.continuum.io/anaconda/ to download and install the library.
After installation, we can use it like other Python...
Let's first get acquainted with two of Pandas' primary data structures: the Series and the DataFrame. They can handle the majority of use cases in finance, statistic, social science, and many areas of engineering.
A Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:
>>> s1 = pd.Series(np.random.rand(4), index=['a', 'b', 'c', 'd']) >>> s1 a 0.6122 b 0.98096 c 0.3350 d 0.7221 dtype: float64
By default, if no index is passed, it will be created to have values ranging from 0
to N-1
, where N
is the length of the Series:
>>> s2 = pd.Series(np.random.rand(4)) >>> s2 0 0.6913 1 0.8487 2 0.8627 3 0.7286 dtype: float64
We can access the value of a Series by using the index:
>>> s1['c'] 0.3350 >>>s1['c'] = 3.14 >>> s1['c', 'a', 'b'] c 3.14 a 0.6122 b 0.98096
This accessing...
Pandas supports many essential functionalities that are useful to manipulate Pandas data structures. In this book, we will focus on the most important features regarding exploration and analysis.
Reindex is a critical method in the Pandas data structures. It confirms whether the new or modified data satisfies a given set of labels along a particular axis of Pandas object.
First, let's view a reindex
example on a Series object:
>>> s2.reindex([0, 2, 'b', 3]) 0 0.6913 2 0.8627 b NaN 3 0.7286 dtype: float64
When reindexed
labels do not exist in the data object, a default value of NaN
will be automatically assigned to the position; this holds true for the DataFrame case as well:
>>> df1.reindex(index=[0, 2, 'b', 3], columns=['Density', 'Year', 'Median_Age','C']) Density Year Median_Age C 0 244 2000 24.2 NaN 2 268 2010 28.5 NaN b NaN NaN ...
In this section, we will focus on how to get, set, or slice subsets of Pandas data structure objects. As we learned in previous sections, Series or DataFrame objects have axis labeling information. This information can be used to identify items that we want to select or assign a new value to in the object:
>>> s4[['024', '002']] # selecting data of Series object 024 NaN 002 Mary dtype: object >>> s4[['024', '002']] = 'unknown' # assigning data >>> s4 024 unknown 065 NaN 002 unknown 001 Nam dtype: object
If the data object is a DataFrame structure, we can also proceed in a similar way:
>>> df5[['b', 'c']] b c 0 1 2 1 4 5 2 7 8
For label indexing on the rows of DataFrame, we use the ix
function that enables us to select a set of rows and columns in the object. There are two parameters that we need to specify: the row
and column
labels that we want to get. By default, if we do not specify...
Let's start with correlation and covariance computation between two data objects. Both the Series and DataFrame have a cov
method. On a DataFrame object, this method will compute the covariance between the Series inside the object:
>>> s1 = pd.Series(np.random.rand(3)) >>> s1 0 0.460324 1 0.993279 2 0.032957 dtype: float64 >>> s2 = pd.Series(np.random.rand(3)) >>> s2 0 0.777509 1 0.573716 2 0.664212 dtype: float64 >>> s1.cov(s2) -0.024516360159045424 >>> df8 = pd.DataFrame(np.random.rand(12).reshape(4,3), columns=['a','b','c']) >>> df8 a b c 0 0.200049 0.070034 0.978615 1 0.293063 0.609812 0.788773 2 0.853431 0.243656 0.978057 0.985584 0.500765 0.481180 >>> df8.cov() a b c a 0.155307 0.021273 -0.048449 b 0.021273 0.059925 -0.040029 c -0.048449 -0.040029 0.055067
Usage of the correlation...
In this section, we will discuss missing, NaN
, or null
values, in Pandas data structures. It is a very common situation to arrive with missing data in an object. One such case that creates missing data is reindexing:
>>> df8 = pd.DataFrame(np.arange(12).reshape(4,3), columns=['a', 'b', 'c']) a b c 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11 >>> df9 = df8.reindex(columns = ['a', 'b', 'c', 'd']) a b c d 0 0 1 2 NaN 1 3 4 5 NaN 2 6 7 8 NaN 4 9 10 11 NaN >>> df10 = df8.reindex([3, 2, 'a', 0]) a b c 3 9 10 11 2 6 7 8 a NaN NaN NaN 0 0 1 2
To manipulate missing values, we can use the isnull()
or notnull()
functions to detect the missing values in a Series object, as well as in a DataFrame object:
>>> df10.isnull() a b c 3 False False False 2 False False False a True True True 0 False False False
In this section we will consider some advanced Pandas use cases.
Hierarchical indexing provides us with a way to work with higher dimensional data in a lower dimension by structuring the data object into multiple index levels on an axis:
>>> s8 = pd.Series(np.random.rand(8), index=[['a','a','b','b','c','c', 'd','d'], [0, 1, 0, 1, 0,1, 0, 1, ]]) >>> s8 a 0 0.721652 1 0.297784 b 0 0.271995 1 0.125342 c 0 0.444074 1 0.948363 d 0 0.197565 1 0.883776 dtype: float64
In the preceding example, we have a Series object that has two index levels. The object can be rearranged into a DataFrame using the unstack
function. In an inverse situation, the stack
function can be used:
>>> s8.unstack() 0 1 a 0.549211 0.420874 b 0.051516 0.715021 c 0.503072 0.720772 d 0.373037 0.207026
We can also create a DataFrame to have a hierarchical index in both axes...
We have finished covering the basics of the Pandas data analysis library. Whenever you learn about a library for data analysis, you need to consider the three parts that we explained in this chapter. Data structures: we have two common data object types in the Pandas library; Series and DataFrames. Method to access and manipulate data objects: Pandas supports many way to select, set or slice subsets of data object. However, the general mechanism is using index labels or the positions of items to identify values. Functions and utilities: They are the most important part of a powerful library. In this chapter, we covered all common supported functions of Pandas which allow us compute statistics on data easily. The library also has a lot of other useful functions and utilities that we could not explain in this chapter. We encourage you to start your own research, if you want to expand your experience with Pandas. It helps us to process large data in an optimized way. You will see more...