Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Learning Pandas
Learning Pandas

Learning Pandas: Get to grips with pandas - a versatile and high-performance Python library for data manipulation, analysis, and discovery

eBook
$45.99 $51.99
Paperback
$65.99
Hardcover
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Learning Pandas

Chapter 1. A Tour of pandas

In this chapter, we will take a look at pandas, which is an open source Python-based data analysis library. It provides high-performance and easy-to-use data structures and data analysis tools built with the Python programming language. The pandas library brings many of the good things from R, specifically the DataFrame objects and R packages such as plyr and reshape2, and places them in a single library that you can use in your Python applications.

The development of pandas was begun in 2008 by Wes McKinney when he worked at AQR Capital Management. It was opened sourced in 2009 and is currently supported and actively developed by various organizations and contributors. It was initially designed with finance in mind, specifically with its ability around time series data manipulation, but emphasizes the data manipulation part of the equation leaving statistical, financial, and other types of analyses to other Python libraries.

In this chapter, we will take a brief tour of pandas and some of the associated tools such as IPython notebooks. You will be introduced to a variety of concepts in pandas for data organization and manipulation in an effort to form both a base understanding and a frame of reference for deeper coverage in later sections of this book. By the end of this chapter, you will have a good understanding of the fundamentals of pandas and even be able to perform basic data manipulations. Also, you will be ready to continue with later portions of this book for more detailed understanding.

This chapter will introduce you to:

  • pandas and why it is important
  • IPython and IPython Notebooks
  • Referencing pandas in your application
  • The Series and DataFrame objects of pandas
  • How to load data from files and the Web
  • The simplicity of visualizing pandas data

Note

pandas is always lowercase by convention in pandas documentation, and this will be a convention followed by this book.

pandas and why it is important

pandas is a library containing high-level data structures and tools that have been created to assist a Python programmer to perform powerful data manipulations, and discover information in that data in a simple and fast way.

The simple and effective data analysis requires the ability to index, retrieve, tidy, reshape, combine, slice, and perform various analyses on both single and multidimensional data, including heterogeneous typed data that is automatically aligned along index labels. To enable these capabilities, pandas provides the following features (and many more not explicitly mentioned here):

  • High performance array and table structures for representation of homogenous and heterogeneous data sets: the Series and DataFrame objects
  • Flexible reshaping of data structure, allowing the ability to insert and delete both rows and columns of tabular data
  • Hierarchical indexing of data along multiple axes (both rows and columns), allowing multiple labels per data item
  • Labeling of series and tabular data to facilitate indexing and automatic alignment of data
  • Ability to easily identify and fix missing data, both in floating point and as non-floating point formats
  • Powerful grouping capabilities and a functionality to perform split-apply-combine operations on series and tabular data
  • Simple conversion from ragged and differently indexed data of both NumPy and Python data structures to pandas objects
  • Smart label-based slicing and subsetting of data sets, including intuitive and flexible merging, and joining of data with SQL-like constructs
  • Extensive I/O facilities to load and save data from multiple formats including CSV, Excel, relational and non-relational databases, HDF5 format, and JSON
  • Explicit support for time series-specific functionality, providing functionality for date range generation, moving window statistics, time shifting, lagging, and so on
  • Built-in support to retrieve and automatically parse data from various web-based data sources such as Yahoo!, Google Finance, the World Bank, and several others

For those desiring to get into data analysis and the emerging field of data science, pandas offers an excellent means for a Python programmer (or just an enthusiast) to learn data manipulation. For those just learning or coming from a statistical language like R, pandas can offer an excellent introduction to Python as a programming language.

pandas itself is not a data science toolkit. It does provide some statistical methods as a matter of convenience, but to draw conclusions from data, it leans upon other packages in the Python ecosystem, such as SciPy, NumPy, scikit-learn, and upon graphics libraries such as matplotlib and ggvis for data visualization. This is actually the strength of pandas over other languages such as R, as pandas applications are able to leverage an extensive network of robust Python frameworks already built and tested elsewhere.

In this book, we will look at how to use pandas for data manipulation, with a specific focus on gathering, cleaning, and manipulation of various forms of data using pandas. Detailed specifics of data science, finance, econometrics, social network analysis, Python, and IPython are left as reference. You can refer to some other excellent books on these topics already available at https://www.packtpub.com/.

pandas and IPython Notebooks

A popular means of using pandas is through the use of IPython Notebooks. IPython Notebooks provide a web-based interactive computational environment, allowing the combination of code, text, mathematics, plots, and right media into a web-based document. IPython Notebooks run in a browser and contain Python code that is run in a local or server-side Python session that the notebooks communicate with using WebSockets. Notebooks can also contain markup code and rich media content, and can be converted to other formats such as PDF, HTML, and slide shows.

The following is an example of an IPython Notebook from the IPython website (http://ipython.org/notebook.html) that demonstrates the rich capabilities of notebooks:

pandas and IPython Notebooks

IPython Notebooks are not strictly required for using pandas and can be installed into your development environment independently or alongside of pandas. During the course of this this book, we will install pandas and an IPython Notebook server. You will be able to perform code examples in the text directly in an IPython console interpreter, and the examples will be packaged as notebooks that can be run with a local notebook server. Additionally, the workbooks will be available online for easy and immediate access at https://wakari.io/sharing/bundle/LearningPandas/LearningPandas_Index.

Note

To learn more about IPython Notebooks, visit the notebooks site at http://ipython.org/ipython-doc/dev/notebook/, and for more in-depth coverage, refer to another book, Learning IPython for Interactive Computing and Data Visualization, Cyrille Rossant, Packt Publishing.

Referencing pandas in the application

All pandas programs and examples in this book will always start by importing pandas (and NumPy) into the Python environment. There is a common convention used in many publications (web and print) of importing pandas and NumPy, which will also be used throughout this book. All workbooks and examples for chapters will start with code similar to the following to initialize the pandas library within Python.

In [1]:
   # import numpy and pandas, and DataFrame / Series
   import numpy as np
   import pandas as pd
   from pandas import DataFrame, Series

   # Set some pandas options
   pd.set_option('display.notebook_repr_html', False)
   pd.set_option('display.max_columns', 10)
   pd.set_option('display.max_rows', 10)

   # And some items for matplotlib
   %matplotlib inline 
   import matplotlib.pyplot as plt
   pd.options.display.mpl_style = 'default'

NumPy and pandas go hand-in-hand, as much of pandas is built on NumPy. It is, therefore, very convenient to import NumPy and put it in a np. namespace. Likewise, pandas is imported and referenced with a pd. prefix. Since DataFrame and Series objects of pandas are used so frequently, the third line then imports the Series and DataFrame objects into the global namespace so that we can use them without a pd. prefix.

The three pd.set_options() method calls set up some defaults for IPython Notebooks and console output from pandas. These specify how wide and high any output will be, and how many columns it will contain. They can be used to modify the output of IPython and pandas to fit your personal needs to display results. The options set here are convenient for formatting the output of the examples to the constraints of the text.

Primary pandas objects

A programmer of pandas will spend most of their time using two primary objects provided by the pandas framework: Series and DataFrame. The DataFrame objects will be the overall workhorse of pandas and the most frequently used as they provide the means to manipulate tabular and heterogeneous data.

The pandas Series object

The base data structure of pandas is the Series object, which is designed to operate similar to a NumPy array but also adds index capabilities. A simple way to create a Series object is by initializing a Series object with a Python array or Python list.

In [2]:
   # create a four item DataFrame
   s = Series([1, 2, 3, 4])
   s

Out [2]:
   0    1
   1    2
   2    3
   3    4
   dtype: int64

This has created a pandas Series from the list. Notice that printing the series resulted in what appears to be two columns of data. The first column in the output is not a column of the Series object, but the index labels. The second column is the values of the Series object. Each row represents the index label and the value for that label. This Series was created without specifying an index, so pandas automatically creates indexes starting at zero and increasing by one.

Elements of a Series object can be accessed through the index using []. This informs the Series which value to return given one or more index values (referred to in pandas as labels). The following code retrieves the items in the series with labels 1 and 3.

In [3]:
   # return a Series with the rows with labels 1 and 3
   s[[1, 3]]

Out [3]:
   1    2
   3    4
   dtype: int64

Note

It is important to note that the lookup here is not by zero-based positions 1 and 3 like an array, but by the values in the index.

A Series object can be created with a user-defined index by specifying the labels for the index using the index parameter.

In [4]:
   # create a series using an explicit index
   s = Series([1, 2, 3, 4], 
              index = ['a', 'b', 'c', 'd'])
   s

Out [4]:
   a    1
   b    2
   c    3
   d    4
   dtype: int64

Note

Notice that the index labels in the output now have the index values that were specified in the Series constructor.

Data in the Series object can now be accessed by alphanumeric index labels by passing a list of the desired labels, as the following demonstrates:

In [5]:
   # look up items the series having index 'a' and 'd'
   s[['a', 'd']]

Out [5]:
   a    1
   d    4
   dtype: int64

Note

This demonstrates the previous point that the lookup is by label value and not by zero-based position.

It is still possible to refer to the elements of the Series object by their numerical position.

In [6]:
   # passing a list of integers to a Series that has
   # non-integer index labels will look up based upon
   # 0-based index like an array
   s[[1, 2]]

Out [6]:
   b    2
   c    3
   dtype: int64

Note

A Series is still smart enough to determine that you passed a list of integers and, therefore, that you want to do value lookup by zero-based position.

The s.index property allows direct access to the index of the Series object.

In [7]:
   # get only the index of the Series
   s.index

Out [7]:
   Index([u'a', u'b', u'c', u'd'], dtype='object')

The index is itself actually a pandas object. This shows us the values of the index and that the data type of each label in the index is object.

A common usage of a Series in pandas is to represent a time series that associates date/time index labels with a value. A date range can be created using the pandas method pd.date_range().

In [8]:
   # create a Series who's index is a series of dates
   # between the two specified dates (inclusive)
   dates = pd.date_range('2014-07-01', '2014-07-06')
   dates

Out [8]:
   <class 'pandas.tseries.index.DatetimeIndex'>
   [2014-07-01, ..., 2014-07-06]
   Length: 6, Freq: D, Timezone: None

Note

This has created a special index in pandas referred to as a DatetimeIndex, which is a pandas index that is optimized to index data with dates and times.

At this point, the index is not particularly useful without having values for each index. We can use this index to create a new Series object with values for each of the dates.

In [9]:
   # create a Series with values (representing temperatures)
   # for each date in the index
   temps1 = Series([80, 82, 85, 90, 83, 87], 
                   index = dates)
   temps1

Out [9]:
   2014-07-01    80
   2014-07-02    82
   2014-07-03    85
   2014-07-04    90
   2014-07-05    83
   2014-07-06    87
   Freq: D, dtype: int64

Statistical methods provided by NumPy can be applied to a pandas Series. The following returns the mean of the values in the Series.

In [10]:
   # calculate the mean of the values in the Series
   temps1.mean()

Out [10]:
   84.5

Two Series objects can be applied to each other with an arithmetic operation. The following code calculates the difference in temperature between two Series.

In [11]:
   # create a second series of values using the same index
   temps2 = Series([70, 75, 69, 83, 79, 77], 
                   index = dates)
   # the following aligns the two by their index values
   # and calculates the difference at those matching labels
   temp_diffs = temps1 - temps2
   temp_diffs

Out [11]:
   2014-07-01    10
   2014-07-02     7
   2014-07-03    16
   2014-07-04     7
   2014-07-05     4
   2014-07-06    10
   Freq: D, dtype: int64

Note

The result of an arithmetic operation (+, -, /, *, …) on two Series objects that are non-scalar values returns another Series object.

Time series data such as that shown here can also be accessed via the index or by an offset into the Series object.

In [12]:
   # lookup a value by date using the index
   temp_diffs['2014-07-03']

Out [12]:
   16

In [13]:
   # and also possible by integer position as if the 
   # series was an array
   temp_diffs[2]

Out [13]:
   16

The pandas DataFrame object

A pandas Series represents a single array of values, with an index label for each value. If you want to have more than one Series of data that is aligned by a common index, then a pandas DataFrame is used.

Note

In a way a DataFrame is analogous to a database table in that it contains one or more columns of data of heterogeneous type (but a single type for all items in each respective column).

The following code creates a DataFrame object with two columns representing the temperatures from the Series objects used earlier.

In [14]:
   # create a DataFrame from the two series objects temp1 and temp2
   # and give them column names
   temps_df = DataFrame(
               {'Missoula': temps1, 
                'Philadelphia': temps2})
   temps_df

Out [14]:
               Missoula  Philadelphia
   2014-07-01        80            70
   2014-07-02        82            75
   2014-07-03        85            69
   2014-07-04        90            83
   2014-07-05        83            79
   2014-07-06        87            77

Note

This has created a DataFrame object with two columns, named Missoula and Philadelphia, and using the values from the respective Series objects for each. These are new Series objects contained within the DataFrame, with the values copied from the original Series objects.

Columns in a DataFrame object can be accessed using an array indexer [] with the name of the column or a list of column names. The following code retrieves the Missoula column of the DataFrame object:

In [15]
   # get the column with the name Missoula
   temps_df['Missoula']

Out [15]:
   2014-07-01    80
   2014-07-02    82
   2014-07-03    85
   2014-07-04    90
   2014-07-05    83
   2014-07-06    87
   Freq: D, Name: Missoula, dtype: int64

The following code retrieves the Philadelphia column:

In [16]:
   # likewise we can get just the Philadelphia column
   temps_df['Philadelphia']

Out [16]:
   2014-07-01    70
   2014-07-02    75
   2014-07-03    69
   2014-07-04    83
   2014-07-05    79
   2014-07-06    77
   Freq: D, Name: Philadelphia, dtype: int64

The following code returns both the columns, but reversed.

In [17]:
   # return both columns in a different order
   temps_df[['Philadelphia', 'Missoula']]

Out [17]:
               Philadelphia  Missoula
   2014-07-01            70        80
   2014-07-02            75        82
   2014-07-03            69        85
   2014-07-04            83        90
   2014-07-05            79        83
   2014-07-06            77        87

Note

Notice that there is a subtle difference in a DataFrame object as compared to a Series object. Passing a list to the [] operator of DataFrame retrieves the specified columns, whereas Series uses it as index labels to retrieve rows.

Very conveniently, if the name of a column does not have spaces, you can use property-style names to access the columns in a DataFrame.

In [18]:
   # retrieve the Missoula column through property syntax
   temps_df.Missoula

Out [18]:
   2014-07-01    80
   2014-07-02    82
   2014-07-03    85
   2014-07-04    90
   2014-07-05    83
   2014-07-06    87
   Freq: D, Name: Missoula, dtype: int64

Arithmetic operations between columns within a DataFrame are identical in operation to those on multiple Series as each column in a DataFrame is a Series. To demonstrate, the following code calculates the difference between temperatures using property notation.

In [19]:
   # calculate the temperature difference between the two cities
   temps_df.Missoula - temps_df.Philadelphia

Out [19]:
   2014-07-01    10
   2014-07-02     7
   2014-07-03    16
   2014-07-04     7
   2014-07-05     4
   2014-07-06    10
   Freq: D, dtype: int64

A new column can be added to DataFrame simply by assigning another Series to a column using the array indexer [] notation. The following code adds a new column in the DataFrame, which contains the difference in temperature on the respective dates.

In [20]:
   # add a column to temp_df that contains the difference in temps
   temps_df['Difference'] = temp_diffs
   temps_df

Out [20]:
               Missoula  Philadelphia  Difference
   2014-07-01        80            70          10
   2014-07-02        82            75           7
   2014-07-03        85            69          16
   2014-07-04        90            83           7
   2014-07-05        83            79           4
   2014-07-06        87            77          10

The names of the columns in a DataFrame are object accessible via the DataFrame object's .columns property, which itself is a pandas Index object.

In [21]:
   # get the columns, which is also an Index object
   temps_df.columns

Out [21]:
   Index([u'Missoula', u'Philadelphia', u'Difference'], dtype='object')

The DataFrame (and Series) objects can be sliced to retrieve specific rows. A simple example here shows how to select the second through fourth rows of temperature difference values.

In [22]:
   # slice the temp differences column for the rows at 
   # location 1 through 4 (as though it is an array)
   temps_df.Difference[1:4]

Out [22]:
   2014-07-02     7
   2014-07-03    16
   2014-07-04     7
   Freq: D, Name: Difference, dtype: int64

Entire rows from a DataFrame can be retrieved using its .loc and .iloc properties. The following code returns a Series object representing the second row of temps_df of the DataFrame object by zero-based position of the row using the .iloc property:

In [23]:
   # get the row at array position 1
   temps_df.iloc[1]

Out [23]:
   Missoula        82
   Philadelphia    75
   Difference       7
   Name: 2014-07-02 00:00:00, dtype: int64

This has converted the row into a Series, with the column names of the DataFrame pivoted into the index labels of the resulting Series.

In [24]:
   # the names of the columns have become the index
   # they have been 'pivoted'
   temps_df.ix[1].index

Out [24]:
   Index([u'Missoula', u'Philadelphia', u'Difference'], dtype='object')

Rows can be explicitly accessed via index label using the .loc property. The following code retrieves a row by the index label:

In [25]:
   # retrieve row by index label using .loc
   temps_df.loc['2014-07-03']

Out [25]:
   Missoula        85
   Philadelphia    69
   Difference      16
   Name: 2014-07-03 00:00:00, dtype: int64

Specific rows in a DataFrame object can be selected using a list of integer positions. The following code selects the values from the Difference column in rows at locations 1, 3, and 5.

In [26]:
   # get the values in the Differences column in rows 1, 3, and 5
   # using 0-based location
   temps_df.iloc[[1, 3, 5]].Difference

Out [26]:
   2014-07-02     7
   2014-07-04     7
   2014-07-06    10
   Name: Difference, dtype: int64

Rows of a DataFrame can be selected based upon a logical expression applied to the data in each row. The following code returns the evaluation of the value in the Missoula temperature column being greater than 82 degrees:

In [27]:
   # which values in the Missoula column are > 82?
   temps_df.Missoula > 82

Out [27]:
   2014-07-01    False
   2014-07-02    False
   2014-07-03     True
   2014-07-04     True
   2014-07-05     True
   2014-07-06     True
   Freq: D, Name: Missoula, dtype: bool

When using the result of an expression as the parameter to the [] operator of a DataFrame, the rows where the expression evaluated to True will be returned.

In [28]:
   # return the rows where the temps for Missoula > 82
   temps_df[temps_df.Missoula > 82]

Out [28]:
               Missoula  Philadelphia  Difference
   2014-07-03        85            69          16
   2014-07-04        90            83           7
   2014-07-05        83            79           4
   2014-07-06        87            77          10

This technique of selection in pandas terminology is referred to as a Boolean selection, and will form the basis of selecting data based upon its values.

Loading data from files and the Web

The data used in analyses is typically provided from other systems via files that are created and updated at various intervals, dynamically via access over the Web, or from various types of databases. The pandas library provides powerful facilities for easy retrieval of data from a variety of data sources and converting it into pandas objects. Here, we will briefly demonstrate this ease of use by loading data from files and from financial web services.

Loading CSV data from files

The pandas library provides built-in support for loading data in .csv format, a common means of storing structured data in text files. Provided with the code from this book is a file data/test1.csv in the CSV format, which represents some time series information. The specific content isn't important right now, as we just want to demonstrate the ease of loading data into a DataFrame.

The following statement in IPython uses the operating system to display the content of this file (the command to use is different based upon your operating system).

In [29]:
   # display the contents of test1.csv
   # which command to use depends on your OS
   !cat data/test1.csv # on non-windows systems
   #!type data\test1.csv # on windows systems

   date,0,1,2
   2000-01-01 00:00:00,1.10376250134,-1.90997889703,-0.808955536115
   2000-01-02 00:00:00,1.18891664768,0.581119740849,0.86159734949
   2000-01-03 00:00:00,-0.964200042412,0.779764393246,1.82906224532
   2000-01-04 00:00:00,0.782130444001,-1.72066965573,-1.10824167327
   2000-01-05 00:00:00,-1.86701699823,-0.528368292754,-2.48830894087
   2000-01-06 00:00:00,2.56928022646,-0.471901478927,-0.835033249865
   2000-01-07 00:00:00,-0.39932258251,-0.676426550985,-0.0112559158931
   2000-01-08 00:00:00,1.64299299394,1.01341997845,1.43566709724
   2000-01-09 00:00:00,1.14730764657,2.13799951538,0.554171306191
   2000-01-10 00:00:00,0.933765825769,1.38715526486,-0.560142729978

This information can be easily imported into DataFrame using the pd.read_csv() function.

In [30]:
   # read the contents of the file into a DataFrame
   df = pd.read_csv('data/test1.csv')
   df

Out [30]:
                     date         0         1         2
   0  2000-01-01 00:00:00  1.103763 -1.909979 -0.808956
   1  2000-01-02 00:00:00  1.188917  0.581120  0.861597
   2  2000-01-03 00:00:00 -0.964200  0.779764  1.829062
   3  2000-01-04 00:00:00  0.782130 -1.720670 -1.108242
   4  2000-01-05 00:00:00 -1.867017 -0.528368 -2.488309
   5  2000-01-06 00:00:00  2.569280 -0.471901 -0.835033
   6  2000-01-07 00:00:00 -0.399323 -0.676427 -0.011256
   7  2000-01-08 00:00:00  1.642993  1.013420  1.435667
   8  2000-01-09 00:00:00  1.147308  2.138000  0.554171
   9  2000-01-10 00:00:00  0.933766  1.387155 -0.560143

pandas has no idea that the first column is a date and has treated the contents of the date field as a string. This can be verified using the following Python statements:

In [31]:
   # the contents of the date column
   df.date

Out [31]:
   0    2000-01-01 00:00:00
   1    2000-01-02 00:00:00
   2    2000-01-03 00:00:00
   3    2000-01-04 00:00:00
   4    2000-01-05 00:00:00
   5    2000-01-06 00:00:00
   6    2000-01-07 00:00:00
   7    2000-01-08 00:00:00
   8    2000-01-09 00:00:00
   9    2000-01-10 00:00:00
   Name: date, dtype: object

In [32]:
   # we can get the first value in the date column
   df.date[0]

Out [32]:
   '2000-01-01 00:00:00'

In [33]:
   # it is a string
   type(df.date[0])

Out [33]:
   str

To guide pandas on how to convert data directly into a Python/pandas date object, we can use the parse_dates parameter of the pd.read_csv() function. The following code informs pandas to convert the content of the 'date' column into actual TimeStamp objects.

In [34]:
   # read the data and tell pandas the date column should be 
   # a date in the resulting DataFrame
   df = pd.read_csv('data/test1.csv', parse_dates=['date'])
   df

Out [34]:
           date         0         1         2
   0 2000-01-01  1.103763 -1.909979 -0.808956
   1 2000-01-02  1.188917  0.581120  0.861597
   2 2000-01-03 -0.964200  0.779764  1.829062
   3 2000-01-04  0.782130 -1.720670 -1.108242
   4 2000-01-05 -1.867017 -0.528368 -2.488309
   5 2000-01-06  2.569280 -0.471901 -0.835033
   6 2000-01-07 -0.399323 -0.676427 -0.011256
   7 2000-01-08  1.642993  1.013420  1.435667
   8 2000-01-09  1.147308  2.138000  0.554171
   9 2000-01-10  0.933766  1.387155 -0.560143

On checking whether it worked, we see it is indeed a Timestamp object now.

In [35]:
   # verify the type now is date
   # in pandas, this is actually a Timestamp
   type(df.date[0])

Out [35]:
   pandas.tslib.Timestamp

Unfortunately, this has not used the date field as the index for the DataFrame, instead it uses the default zero-based integer index labels.

In [36]:
   # unfortunately the index is numeric, which makes
   # accessing data by date more complicated
   df.index

Out [36]:
   Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

This can be rectified using the index_col parameter of the pd.read_csv() method to specify which column in the file should be used as the index.

In [37]:
   # read in again, now specify the data column as being the 
   # index of the resulting DataFrame
   df = pd.read_csv('data/test1.csv', 
                    parse_dates=['date'], 
                    index_col='date')
   df

Out [37]:
                      0         1         2
   date                                    
   2000-01-01  1.103763 -1.909979 -0.808956
   2000-01-02  1.188917  0.581120  0.861597
   2000-01-03 -0.964200  0.779764  1.829062
   2000-01-04  0.782130 -1.720670 -1.108242
   2000-01-05 -1.867017 -0.528368 -2.488309
   2000-01-06  2.569280 -0.471901 -0.835033
   2000-01-07 -0.399323 -0.676427 -0.011256
   2000-01-08  1.642993  1.013420  1.435667
   2000-01-09  1.147308  2.138000  0.554171
   2000-01-10  0.933766  1.387155 -0.560143

In [38]:
   df.index

Out [38]:
   <class 'pandas.tseries.index.DatetimeIndex'>
   [2000-01-01, ..., 2000-01-10]
   Length: 10, Freq: None, Timezone: None

Loading data from the Web

Data from the Web can also be easily read via pandas. To demonstrate this, we will perform a simple load of actual stock data. The example here uses the pandas.io.data.DataReader class, which is able to read data from various web sources, one of which is stock data from Yahoo! Finance.

The following reads the data of the previous three months for GOOG (based on the current date), and prints the five most recent days of stock data:

In [39]:
   # imports for reading data from Yahoo!
   from pandas.io.data import DataReader
   from datetime import date
   from dateutil.relativedelta import relativedelta

   # read the last three months of data for GOOG
   goog = DataReader("GOOG",  "yahoo", 
                     date.today() + 
                     relativedelta(months=-3))

   # the result is a DataFrame
   #and this gives us the 5 most recent prices
   goog.tail()

Out [39]:
                 Open    High     Low   Close   Volume  Adj Close
   Date                                                          
   2015-02-02  531.73  533.00  518.55  528.48  2826300     528.48
   2015-02-03  528.00  533.40  523.26  529.24  2029200     529.24
   2015-02-04  529.24  532.67  521.27  522.76  1656800     522.76
   2015-02-05  523.79  528.50  522.09  527.58  1840300     527.58
   2015-02-06  527.64  537.20  526.41  531.00  1744600     531.00

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. The code examples in the book are also publicly available on Wakari.io at https://wakari.io/sharing/bundle/LearningPandas/LearningPandas_Index.

This is actually performs quite a bit of work on your behalf. It makes the web requests retrieving the CSV data and converting it into a DataFrame with the proper conversion of types for the various series of data.

Simplicity of visualization of pandas data

Visualizing pandas data is incredibly simple as pandas is built with tight integration with the matplotlib framework. To demonstrate how simple it is to visualize data with pandas, the following code plots the stock data we just retrieved from Yahoo! Finance:

In [40]:
   # plot the Adj Close values we just read in
   goog.plot(y='Adj Close');
Simplicity of visualization of pandas data

Note

We will dive deeper and broader into pandas data visualization in a section dedicated to it later in this book.

Summary

In this chapter we have taken a quick tour of the capabilities of pandas, and how easily you can use it to create, load, manipulate, and visualize data. Through the remainder of this book, we will dive into everything covered in this chapter in significant detail, fully demonstrating how to utilize the facilities of pandas for powerful data manipulation.

In the next chapter, we will look at how to get and install both Python and pandas. Following the installation, in Chapter 3, NumPy for pandas, we will dive into the NumPy framework as it applies to pandas, demonstrating how NumPy provides the core functionality to slice and dice array-based data in array-like manner, as the pandas Series and DataFrame objects extensively leverage the capabilities of NumPy.

Left arrow icon Right arrow icon

Description

If you are a Python programmer who wants to get started with performing data analysis using pandas and Python, this is the book for you. Some experience with statistical analysis would be helpful but is not mandatory.
Estimated delivery fee Deliver to United States

Economy delivery 10 - 13 business days

Free $6.95

Premium delivery 6 - 9 business days

$21.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 16, 2015
Length: 504 pages
Edition : 1st
Language : English
ISBN-13 : 9781783985128
Category :
Languages :
Concepts :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to United States

Economy delivery 10 - 13 business days

Free $6.95

Premium delivery 6 - 9 business days

$21.95
(Includes tracking information)

Product Details

Publication date : Apr 16, 2015
Length: 504 pages
Edition : 1st
Language : English
ISBN-13 : 9781783985128
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 175.97
Mastering Predictive Analytics with R
$54.99
Learning Pandas
$65.99
Practical Data Analysis
$54.99
Total $ 175.97 Stars icon

Table of Contents

13 Chapters
1. A Tour of pandas Chevron down icon Chevron up icon
2. Installing pandas Chevron down icon Chevron up icon
3. NumPy for pandas Chevron down icon Chevron up icon
4. The pandas Series Object Chevron down icon Chevron up icon
5. The pandas DataFrame Object Chevron down icon Chevron up icon
6. Accessing Data Chevron down icon Chevron up icon
7. Tidying Up Your Data Chevron down icon Chevron up icon
8. Combining and Reshaping Data Chevron down icon Chevron up icon
9. Grouping and Aggregating Data Chevron down icon Chevron up icon
10. Time-series Data Chevron down icon Chevron up icon
11. Visualization Chevron down icon Chevron up icon
12. Applications to Finance Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.2
(10 Ratings)
5 star 50%
4 star 40%
3 star 0%
2 star 0%
1 star 10%
Filter icon Filter
Top Reviews

Filter reviews by




Natester Jun 06, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I've been working with the pandas library for a while but had been looking for a text to help navigate the rich feature set of the pandas library. I purchased this book as soon as it became available and I'm quite satisfied with the content.I skipped the first few chapters, but if you are new to Python and using Python packages, do be sure to go through the content.The next couple of chapters discuss the inner workings of pandas DataFrame and Series. Worth going through as it provides a foundation for the remainder of the book's examples.Around chapter 6 is where the application examples dig in and they are quite useful. I've referred to many of these examples. They include reading and writing data with different data sources, slicing and dicing data and running stats on your data.Examples towards the end of the book get progressively sophisticated with shaping data. I didn't read everything in those chapters, but towards the end of the book are some chapters on data visualization and working with time series data. Definitely a "must" if you are looking to make use of pandas in your data analysis work.I keep this ebook in my reference collection and refer to it when in need to figure out how to solve a data issue where pandas might be a good fit. A helpful book in the Python + data space.
Amazon Verified review Amazon
Loris Jun 20, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
In my job I make use of many scientific libraries and Pandas is one of those. I have been looking for a good Pandas reference book for a while and I got this book as soon as it was published. I am not exaggerating when I say that it is one of the best python-related book I ever read. It is not only well written but it is also very well organized and structured. The book begins by providing detailed instructions on how to install Pandas on Linux, MacOS X, and Windows. In the firsts chapters it introduces NumPy and both Pandas Series and DataFrames. These firsts chapters are really important, especially for beginners, as they explain basic concepts that will be used continuously through the book. The author also indicates in which situations Pandas behave differently from NumPy, something I ignored before reading the book.I found very useful the description of the different ways to access rows and columns in DataFrames (loc, iloc, ix, etc.). The author clearly explains which is the best method to use in different scenarios and gives important tips regarding the performances of the different methods. Personally the chapters I found more useful were those about “Tidying Up your Data”, “Combining and Reshaping Data”, and “Grouping and Aggregating Data”. These are not easy concepts and the author did a very good job explaining them and providing a lot of clear examples. I believe these chapters are where you realize how Pandas can greatly simplify data analysis. The chapter about visualization is particularly useful to those who do not have experience with matplotlib and want to learn how to do quick plots with pandas.To conclude, the book is an excellent guide to Pandas not only if you are a beginner but also if you already have some experience with the library. Beside being well written, it covers all the mayor features of Pandas and each topic is complemented with a lot of code.
Amazon Verified review Amazon
Trevor May 20, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Warning! This is not a book for learning statistical methods of data analysis. Do not buy if that is what you are looking for. If you are interested in learning the tools for data analysis in python, then this book is for you.This book is great for anyone who wants to understand how to use the pandas library. The book is larger than most Packtpub books. The size is primarily due to the number of topics covered and the rich interactive set of examples to illustrate each topic.All the code in the book can be downloaded in the form of ipython notebooks. Which is by far the best learning median for python. This greatly enhances not only your ability to follow along with the examples, but to explore each topic yourself by altering the code to reinforce what you have learned.Also, the books really does the best job I've seen at building each piece of the puzzle one step at a time. The author assumes basic knowledge of python, and some familiarity with statistical definitions. Otherwise, nothing is referenced without first being explained, and everything is introduced in a logical way.If you want to understand how to use pandas from the ground up, then this book is for you.
Amazon Verified review Amazon
Harmon L. Jul 26, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Excellent
Amazon Verified review Amazon
Lidija Novak Jan 05, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Its a beginners book and a bit outdated, still what counts is its content. Straight forward, sharp and great set up of content and examples. What I hate is meaningless nonsense, this author is great of keeping text short but informative.Have not yet finished reading the book but the first chapters have answered all my previous questions I had on Pandas after reading another book. Great buy! After this book I will buy Pandas for Finance written by same author - Michael Heydt!
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the digital copy I get with my Print order? Chevron down icon Chevron up icon

When you buy any Print edition of our Books, you can redeem (for free) the eBook edition of the Print Book you’ve purchased. This gives you instant access to your book when you make an order via PDF, EPUB or our online Reader experience.

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
Modal Close icon
Modal Close icon