Packt+ | Advance your knowledge in tech

You're reading from Learning IPython for Interactive Computing and Data Visualization, Second Edition

Product type Book

Published in Oct 2015

Publisher

ISBN-13 9781783986989

Pages 200 pages

Edition 1st Edition

Languages

Python

Concepts

Scientific Computing

Author (1):

Cyrille Rossant

Table of Contents (13) Chapters

Learning IPython for Interactive Computing and Data Visualization Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Getting Started with IPython

Interactive Data Analysis with pandas

Numerical Computing with NumPy

Interactive Plotting and Graphical Interfaces

High-Performance and Parallel Computing

Customizing IPython

Index

Chapter 2. Interactive Data Analysis with pandas

In this chapter, we will cover the following topics:

Exploring a dataset in the Notebook
Manipulating data
Complex operations

We'll see how to load, explore, and visualize a real-world dataset with pandas, matplotlib, and seaborn, all in the Notebook. We will also perform data manipulations efficiently.

Exploring a dataset in the Notebook

Here, we will explore a dataset containing the taxi trips made in New York City in 2013. Maintained by the New York City Taxi and Limousine Commission, this 50GB dataset contains the date, time, geographical coordinates of pickup and dropoff locations, fare, and other information for 170 million taxi trips.

To keep the analysis times reasonable, we will analyze a subset of this dataset containing 0.5% of all trips (about 850,000 rides). Compressed, this subset data represents a little less than 100MB. You are free to download and analyze the full dataset (or a larger subset), as explained below.

Provenance of the data

You will find the data subset we will be using in this chapter at https://github.com/ipython-books/minibook-2nd-data.

If you are interested in the original dataset containing all trips, you can refer to https://github.com/ipython-books/minibook-2nd-code/tree/master/chapter2/cleaning. This page contains the code to download the original dataset...

Manipulating data

Visualizing raw data and computing basic statistics is particularly easy with pandas. All we have to do is choose a couple of columns in a DataFrame and use built-in statistical or visualization functions.

However, more sophisticated data manipulations methods quickly become necessary as we explore a dataset. In this section, we will first see how to make selections of a DataFrame. Then, we will see how to efficiently make transformations and computations on columns.

We first import the NYC taxi dataset, as in the previous section.

In [1]: import numpy as np
        import pandas as pd
        import matplotlib.pyplot as plt
        %matplotlib inline
        data = pd.read_csv('data/nyc_data.csv', 
                           parse_dates=['pickup_datetime',
                                        'dropoff_datetime'])
        fare = pd.read_csv('data/nyc_fare.csv',
                           parse_dates=['pickup_datetime'])

The data and fare DataFrames are now loaded in the...

Complex operations

We've seen how to load, select, filter, and operate on data with pandas. In this section, we will show more complex manipulations that are typically done on full-blown databases based on SQL.

Tip

SQL

Structured Query Language is a domain-specific language widely used to manage data in relational database management systems (RDBMS). pandas is somewhat inspired by SQL, which is familiar to many data analysts. Additionally, pandas can connect to SQL databases. You will find more information about the links between pandas and SQL at http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html.

Let's first import our NYC taxi dataset as in the previous sections.

In [1]: import numpy as np
        import pandas as pd
        import matplotlib.pyplot as plt
        import seaborn
        %matplotlib inline
        data = pd.read_csv('data/nyc_data.csv',
                           parse_dates=['pickup_datetime',
                                        'dropoff_datetime...

Summary

In this chapter, we covered the basics of data analysis with pandas: loading a dataset, selecting rows and columns, grouping and aggregating quantities, and performing complex operations efficiently.

The next natural step is to conduct statistical analyses: hypothesis testing, modeling, predictions, and so on. Several Python libraries provide such functionality beyond pandas: SciPy, statsmodels, PyMC, and more. The IPython Cookbook contains many advanced examples of such analyses.

In the next chapter, we will introduce NumPy, the library underlying the entire SciPy ecosystem.

The rest of the chapter is locked

You're reading from Learning IPython for Interactive Computing and Data Visualization, Second Edition

Table of Contents (13) Chapters

Chapter 2. Interactive Data Analysis with pandas

Exploring a dataset in the Notebook

Provenance of the data

Manipulating data

Complex operations

Tip

Summary

Authors (1)

Personalised recommendations for you

You're reading from Learning IPython for Interactive Computing and Data Visualization, Second Edition

Table of Contents (13) Chapters

Chapter 2. Interactive Data Analysis with pandas

Exploring a dataset in the Notebook

Provenance of the data

Manipulating data

Complex operations

Tip

Summary

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you