Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Getting Started with Python Data Analysis

You're reading from  Getting Started with Python Data Analysis

Product type Book
Published in Nov 2015
Publisher
ISBN-13 9781785285110
Pages 188 pages
Edition 1st Edition
Languages

Table of Contents (15) Chapters

Getting Started with Python Data Analysis
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Introducing Data Analysis and Libraries NumPy Arrays and Vectorized Computation Data Analysis with Pandas Data Visualization Time Series Interacting with Databases Data Analysis Application Examples Machine Learning Models with scikit-learn Index

Chapter 7. Data Analysis Application Examples

In this chapter, we want to get you acquainted with typical data preparation tasks and analysis techniques, because being fluent in preparing, grouping, and reshaping data is an important building block for successful data analysis.

While preparing data seems like a mundane task – and often it is – it is a step we cannot skip, although we can strive to simplify it by using tools such as Pandas.

Why is preparation necessary at all? Because most useful data will come from the real world and will have deficiencies, contain errors or will be fragmentary.

There are more reasons why data preparation is useful: it gets you in close contact with the raw material. Knowing your input helps you to spot potential errors early and build confidence in your results.

Here are a few data preparation scenarios:

  • A client hands you three files, each containing time series data about a single geological phenomenon, but the observed data is recorded on different intervals...

Data munging


The arsenal of tools for data munging is huge, and while we will focus on Python we want to mention some useful tools as well. If they are available on your system and you expect to work a lot with data, they are worth learning.

One group of tools belongs to the UNIX tradition, which emphasizes text processing and as a consequence has, over the last four decades, developed many high-performance and battle-tested tools for dealing with text. Some common tools are: sed, grep, awk, sort, uniq, tr, cut, tail, and head. They do very elementary things, such as filtering out lines (grep) or columns (cut) from files, replacing text (sed, tr) or displaying only parts of files (head, tail).

We want to demonstrate the power of these tools with a single example only.

Imagine you are handed the log files of a web server and you are interested in the distribution of the IP addresses.

Each line of the log file contains an entry in the common log server format (you can download this data set from...

Data aggregation


As a final topic, we will look at ways to get a condensed view of data with aggregations. Pandas comes with a lot of aggregation functions built-in. We already saw the describe function in Chapter 3, Data Analysis with Pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:

>>> df.head()
   country     city        date  hours
0  Germany  Hamburg  2015-06-01      8
1  Germany  Hamburg  2015-06-02     10
2  Germany  Hamburg  2015-06-03      9
3  Germany  Hamburg  2015-06-04      7
4  Germany  Hamburg  2015-06-05      3

To view a summary per city, we use the describe function on the grouped data set:

>>> df.groupby("city").describe()
                      hours
city
Berlin     count  10.000000
           mean    6.000000
           std     3.741657
           min     0.000000
           25%     4.000000
           50%     6.000000
           75...

Grouping data


One typical workflow during data exploration looks as follows:

  • You find a criterion that you want to use to group your data. Maybe you have GDP data for every country along with the continent and you would like to ask questions about the continents. These questions usually lead to some function applications- you might want to compute the mean GDP per continent. Finally, you want to store this data for further processing in a new data structure.

  • We use a simpler example here. Imagine some fictional weather data about the number of sunny hours per day and city:

    >>> df
              date    city  value
    0   2000-01-03  London      6
    1   2000-01-04  London      3
    2   2000-01-05  London      4
    3   2000-01-03  Mexico      3
    4   2000-01-04  Mexico      9
    5   2000-01-05  Mexico      8
    6   2000-01-03  Mumbai     12
    7   2000-01-04  Mumbai      9
    8   2000-01-05  Mumbai      8
    9   2000-01-03   Tokyo      5
    10  2000-01-04   Tokyo      5
    11  2000-01-05   Tokyo      6
    
  • The groups attributes...

Summary


In this chapter we have looked at ways to manipulate data frames, from cleaning and filtering, to grouping, aggregation, and reshaping. Pandas makes a lot of the common operations very easy and more complex operations, such as pivoting or grouping by multiple attributes, can often be expressed as one-liners as well. Cleaning and preparing data is an essential part of data exploration and analysis.

The next chapter explains a brief of machine learning algorithms that is applying data analysis result to make decisions or build helpful products.

Practice exercises

Exercise 1: Cleaning: In the section about filtering, we used the Europe Brent Crude Oil Spot Price, which can be found as an Excel document on the internet. Take this Excel spreadsheet and try to convert it into a CSV document that is ready to be imported with Pandas.

Hint: There are many ways to do this. We used a small tool called xls2csv.py and we were able to load the resulting CSV file with a helper method:

import datetime...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Getting Started with Python Data Analysis
Published in: Nov 2015 Publisher: ISBN-13: 9781785285110
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}