Packt+ | Advance your knowledge in tech

You're reading from Getting Started with Python Data Analysis

Product type Book

Published in Nov 2015

Publisher

ISBN-13 9781785285110

Pages 188 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Table of Contents (15) Chapters

Getting Started with Python Data Analysis

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Introducing Data Analysis and Libraries

NumPy Arrays and Vectorized Computation

Data Analysis with Pandas

Data Visualization

Time Series

Interacting with Databases

Data Analysis Application Examples

Machine Learning Models with scikit-learn

Index

Chapter 7. Data Analysis Application Examples

In this chapter, we want to get you acquainted with typical data preparation tasks and analysis techniques, because being fluent in preparing, grouping, and reshaping data is an important building block for successful data analysis.

While preparing data seems like a mundane task – and often it is – it is a step we cannot skip, although we can strive to simplify it by using tools such as Pandas.

Why is preparation necessary at all? Because most useful data will come from the real world and will have deficiencies, contain errors or will be fragmentary.

There are more reasons why data preparation is useful: it gets you in close contact with the raw material. Knowing your input helps you to spot potential errors early and build confidence in your results.

Here are a few data preparation scenarios:

A client hands you three files, each containing time series data about a single geological phenomenon, but the observed data is recorded on different intervals...

Data munging

The arsenal of tools for data munging is huge, and while we will focus on Python we want to mention some useful tools as well. If they are available on your system and you expect to work a lot with data, they are worth learning.

One group of tools belongs to the UNIX tradition, which emphasizes text processing and as a consequence has, over the last four decades, developed many high-performance and battle-tested tools for dealing with text. Some common tools are: sed, grep, awk, sort, uniq, tr, cut, tail, and head. They do very elementary things, such as filtering out lines (grep) or columns (cut) from files, replacing text (sed, tr) or displaying only parts of files (head, tail).

We want to demonstrate the power of these tools with a single example only.

Imagine you are handed the log files of a web server and you are interested in the distribution of the IP addresses.

Each line of the log file contains an entry in the common log server format (you can download this data set from...

Data aggregation

As a final topic, we will look at ways to get a condensed view of data with aggregations. Pandas comes with a lot of aggregation functions built-in. We already saw the describe function in Chapter 3, Data Analysis with Pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:

>>> df.head()
   country     city        date  hours
0  Germany  Hamburg  2015-06-01      8
1  Germany  Hamburg  2015-06-02     10
2  Germany  Hamburg  2015-06-03      9
3  Germany  Hamburg  2015-06-04      7
4  Germany  Hamburg  2015-06-05      3

To view a summary per city, we use the describe function on the grouped data set:

>>> df.groupby("city").describe()
                      hours
city
Berlin     count  10.000000
           mean    6.000000
           std     3.741657
           min     0.000000
           25%     4.000000
           50%     6.000000
           75...

Grouping data

One typical workflow during data exploration looks as follows:

You find a criterion that you want to use to group your data. Maybe you have GDP data for every country along with the continent and you would like to ask questions about the continents. These questions usually lead to some function applications- you might want to compute the mean GDP per continent. Finally, you want to store this data for further processing in a new data structure.

We use a simpler example here. Imagine some fictional weather data about the number of sunny hours per day and city:

>>> df
          date    city  value
0   2000-01-03  London      6
1   2000-01-04  London      3
2   2000-01-05  London      4
3   2000-01-03  Mexico      3
4   2000-01-04  Mexico      9
5   2000-01-05  Mexico      8
6   2000-01-03  Mumbai     12
7   2000-01-04  Mumbai      9
8   2000-01-05  Mumbai      8
9   2000-01-03   Tokyo      5
10  2000-01-04   Tokyo      5
11  2000-01-05   Tokyo      6

The groups attributes...

Summary

In this chapter we have looked at ways to manipulate data frames, from cleaning and filtering, to grouping, aggregation, and reshaping. Pandas makes a lot of the common operations very easy and more complex operations, such as pivoting or grouping by multiple attributes, can often be expressed as one-liners as well. Cleaning and preparing data is an essential part of data exploration and analysis.

The next chapter explains a brief of machine learning algorithms that is applying data analysis result to make decisions or build helpful products.

Practice exercises

Exercise 1: Cleaning: In the section about filtering, we used the Europe Brent Crude Oil Spot Price, which can be found as an Excel document on the internet. Take this Excel spreadsheet and try to convert it into a CSV document that is ready to be imported with Pandas.

Hint: There are many ways to do this. We used a small tool called xls2csv.py and we were able to load the resulting CSV file with a helper method: