Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning IPython for Interactive Computing and Data Visualization, Second Edition

You're reading from  Learning IPython for Interactive Computing and Data Visualization, Second Edition

Product type Book
Published in Oct 2015
Publisher
ISBN-13 9781783986989
Pages 200 pages
Edition 1st Edition
Languages
Author (1):
Cyrille Rossant Cyrille Rossant
Profile icon Cyrille Rossant

Chapter 2. Interactive Data Analysis with pandas

In this chapter, we will cover the following topics:

  • Exploring a dataset in the Notebook

  • Manipulating data

  • Complex operations

We'll see how to load, explore, and visualize a real-world dataset with pandas, matplotlib, and seaborn, all in the Notebook. We will also perform data manipulations efficiently.

Exploring a dataset in the Notebook


Here, we will explore a dataset containing the taxi trips made in New York City in 2013. Maintained by the New York City Taxi and Limousine Commission, this 50GB dataset contains the date, time, geographical coordinates of pickup and dropoff locations, fare, and other information for 170 million taxi trips.

To keep the analysis times reasonable, we will analyze a subset of this dataset containing 0.5% of all trips (about 850,000 rides). Compressed, this subset data represents a little less than 100MB. You are free to download and analyze the full dataset (or a larger subset), as explained below.

Provenance of the data

You will find the data subset we will be using in this chapter at https://github.com/ipython-books/minibook-2nd-data.

If you are interested in the original dataset containing all trips, you can refer to https://github.com/ipython-books/minibook-2nd-code/tree/master/chapter2/cleaning. This page contains the code to download the original dataset...

Manipulating data


Visualizing raw data and computing basic statistics is particularly easy with pandas. All we have to do is choose a couple of columns in a DataFrame and use built-in statistical or visualization functions.

However, more sophisticated data manipulations methods quickly become necessary as we explore a dataset. In this section, we will first see how to make selections of a DataFrame. Then, we will see how to efficiently make transformations and computations on columns.

We first import the NYC taxi dataset, as in the previous section.

In [1]: import numpy as np
        import pandas as pd
        import matplotlib.pyplot as plt
        %matplotlib inline
        data = pd.read_csv('data/nyc_data.csv', 
                           parse_dates=['pickup_datetime',
                                        'dropoff_datetime'])
        fare = pd.read_csv('data/nyc_fare.csv',
                           parse_dates=['pickup_datetime'])

The data and fare DataFrames are now loaded in the...

Complex operations


We've seen how to load, select, filter, and operate on data with pandas. In this section, we will show more complex manipulations that are typically done on full-blown databases based on SQL.

Tip

SQL

Structured Query Language is a domain-specific language widely used to manage data in relational database management systems (RDBMS). pandas is somewhat inspired by SQL, which is familiar to many data analysts. Additionally, pandas can connect to SQL databases. You will find more information about the links between pandas and SQL at http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html.

Let's first import our NYC taxi dataset as in the previous sections.

In [1]: import numpy as np
        import pandas as pd
        import matplotlib.pyplot as plt
        import seaborn
        %matplotlib inline
        data = pd.read_csv('data/nyc_data.csv',
                           parse_dates=['pickup_datetime',
                                        'dropoff_datetime...

Summary


In this chapter, we covered the basics of data analysis with pandas: loading a dataset, selecting rows and columns, grouping and aggregating quantities, and performing complex operations efficiently.

The next natural step is to conduct statistical analyses: hypothesis testing, modeling, predictions, and so on. Several Python libraries provide such functionality beyond pandas: SciPy, statsmodels, PyMC, and more. The IPython Cookbook contains many advanced examples of such analyses.

In the next chapter, we will introduce NumPy, the library underlying the entire SciPy ecosystem.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learning IPython for Interactive Computing and Data Visualization, Second Edition
Published in: Oct 2015 Publisher: ISBN-13: 9781783986989
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}