Reader small image

You're reading from  IPython Notebook Essentials

Product typeBook
Published inNov 2014
Publisher
ISBN-139781783988341
Edition1st Edition
Tools
Right arrow
Author (1)
Luiz Felipe Martins
Luiz Felipe Martins
author image
Luiz Felipe Martins

Luiz Felipe Martins holds a PhD in applied mathematics from Brown University and has worked as a researcher and educator for more than 20 years. His research is mainly in the field of applied probability. He has been involved in developing code for the open source homework system, WeBWorK, where he wrote a library for the visualization of systems of differential equations. He was supported by an NSF grant for this project. Currently, he is an Associate Professor in the Department of Mathematics at Cleveland State University, Cleveland, Ohio, where he has developed several courses in applied mathematics and scientific computing. His current duties include coordinating all first-year calculus sessions.
Read more about Luiz Felipe Martins

Right arrow

Chapter 4. Handling Data with pandas

In this chapter, we will introduce pandas, a powerful and versatile Python library that provides tools for data handling and analysis. We will consider the two main pandas structures for storing data, the Series and DataFrame objects, in detail. You will learn how to create these structures and how to access and insert data into them. We also cover the important topic of slicing, that is, how to access portions of data using the different indexing methods provided by pandas. Next, we'll discuss the computational and graphics tools offered by pandas, and finish the chapter by demonstrating how to work with a realistic dataset.

pandas is an extensive package for data-oriented manipulation, and it is beyond the scope of this book to realistically cover all aspects of the package. We will cover only some of the most useful data structures and functionalities. In particular, we will not cover the Panel data structure and multi-indexes. However, we will provide...

The Series class


A Series object represents a one-dimensional, indexed series of data. It can be thought of as a dictionary, with one main difference: the indexes in a Series class are ordered. The following example constructs a Series object and displays it:

grades1 = Series([76, 82, 78, 100],
                 index = ['Alex', 'Robert', 'Minnie', 'Alice'],
                 name = 'Assignment 1', dtype=float64)
grades1

This produces the following output:

Alex       76
Robert     82
Minnie     78
Alice     100
Name: Assignment 1, dtype: float64

Notice the format of the constructor call:

Series(<data>, index=<indexes>, name=<name>, dtype=<type>)

Both data and indexes are usually lists or NumPy arrays, but can be any Python iterable. The lists must have the same length. The name variable is a string that describes the data in the series. The type variable is a NumPy data type. The indexes and the name variables are optional (if indexes are omitted, they are set to integers...

The DataFrame class


The DataFrame class is used to represent two-dimensional data. To illustrate its use, let's create a DataFrame class containing student data as follows:

grades = DataFrame(
    [['Alice',  80., 92., 84,],
     ['Bob',    78., NaN, 86,],
     ['Samaly', 75., 78., 88.]],
    index = [17005, 17035, 17028],
    columns = ['Name', 'Test 1', 'Test 2', 'Final']
    )

This code demonstrates one of the most straightforward ways to construct a DataFrame class. In the preceding case, the data can be specified as any two-dimensional Python data structure, such as a list of lists (as shown in the example) or a NumPy array. The index option sets the row names, which are integers representing student IDs here. Likewise, the columns option sets the column names. Both the index and column arguments can be given as any one-dimensional Python structure, such as lists, NumPy arrays, or a Series object.

To display the output of the DataFrame class, run the following statement in a cell:

grades...

Computational and graphics tools


The objects of pandas have a rich set of built-in computational tools. To illustrate some of this functionality, we will use the random data stored in the dframe object defined in the previous section. If you discarded that object, here is how to construct it again:

means = [0, 0, 1, 1, -1, -1, -2, -2]
sdevs = [1, 2, 1, 2,  1,  2,  1,  2]
random_data = {}
nrows = 30
for mean, sdev in zip(means, sdevs):
    label = 'Mean={}, sd={}'.format(mean, sdev)
    random_data[label] = normal(mean, sdev, nrows)
row_labels = ['Row {}'.format(i) for i in range(nrows)]
dframe = DataFrame (random_data, index=row_labels)

Let's explore some of this functionality of the built-in computational tools.

  • To get a list of the methods available for the object, start typing the following command in a cell:

    dframe.
    
  • Then, press the Tab key. The completion popup allows us to select a method by double clicking on it. For example, double click on mean. The cell text changes to the following...

An example with a realistic dataset


In this section, we will work with a realistic dataset of moderate size. We will use the World Development Indicators dataset, which is provided free of charge by the World Bank. This is a reasonably sized dataset that is not too large or complex to experiment with.

In any real application, we will need to read data from some source, reformat it to our purposes, and save the reformatted data back to some storage system. pandas offers facilities for data retrieval and storage in multiple formats:

  • Comma-separated values (CSV) in text files

  • Excel

  • JSON

  • SQL

  • HTML

  • Stata

  • Clipboard data in text format

  • Python-pickled data

The list of formats supported by pandas keeps growing with each new update to the library. Please refer to http://pandas.pydata.org/pandas-docs/stable/io.html for a current list.

Treating all formats supported by pandas is not possible in a book with the current scope. We will restrict examples to CSV files, which is a simple text format that is widely...

Summary


In this chapter, we covered the objects of pandas, Series and DataFrame, which are specialized containers for data-oriented computations. We discussed how to create, access, and modify these objects, including advanced indexing and slicing operations. We also considered the computational and graphical capabilities offered by pandas. We then discussed how these capabilities can be leveraged to work with a realistic dataset.

In the next chapter, we will learn how to use SciPy to solve advanced mathematical problems of modeling, science, and engineering.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
IPython Notebook Essentials
Published in: Nov 2014Publisher: ISBN-13: 9781783988341
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Luiz Felipe Martins

Luiz Felipe Martins holds a PhD in applied mathematics from Brown University and has worked as a researcher and educator for more than 20 years. His research is mainly in the field of applied probability. He has been involved in developing code for the open source homework system, WeBWorK, where he wrote a library for the visualization of systems of differential equations. He was supported by an NSF grant for this project. Currently, he is an Associate Professor in the Department of Mathematics at Cleveland State University, Cleveland, Ohio, where he has developed several courses in applied mathematics and scientific computing. His current duties include coordinating all first-year calculus sessions.
Read more about Luiz Felipe Martins