Chapter 7. Data Analysis Application Examples
In this chapter, we want to get you acquainted with typical data preparation tasks and analysis techniques, because being fluent in preparing, grouping, and reshaping data is an important building block for successful data analysis.
While preparing data seems like a mundane task – and often it is – it is a step we cannot skip, although we can strive to simplify it by using tools such as Pandas.
Why is preparation necessary at all? Because most useful data will come from the real world and will have deficiencies, contain errors or will be fragmentary.
There are more reasons why data preparation is useful: it gets you in close contact with the raw material. Knowing your input helps you to spot potential errors early and build confidence in your results.
Here are a few data preparation scenarios:
The arsenal of tools for data munging is huge, and while we will focus on Python we want to mention some useful tools as well. If they are available on your system and you expect to work a lot with data, they are worth learning.
One group of tools belongs to the UNIX tradition, which emphasizes text processing and as a consequence has, over the last four decades, developed many high-performance and battle-tested tools for dealing with text. Some common tools are: sed
, grep
, awk
, sort
, uniq
, tr
, cut
, tail
, and head
. They do very elementary things, such as filtering out lines (grep
) or columns (cut
) from files, replacing text (sed
, tr
) or displaying only parts of files (head
, tail
).
We want to demonstrate the power of these tools with a single example only.
Imagine you are handed the log files of a web server and you are interested in the distribution of the IP addresses.
Each line of the log file contains an entry in the common log server format (you can download this data set from...
As a final topic, we will look at ways to get a condensed view of data with aggregations. Pandas comes with a lot of aggregation functions built-in. We already saw the describe
function in Chapter 3, Data Analysis with Pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:
To view a summary per city
, we use the describe
function on the grouped data set:
One typical workflow during data exploration looks as follows:
You find a criterion that you want to use to group your data. Maybe you have GDP data for every country along with the continent and you would like to ask questions about the continents. These questions usually lead to some function applications- you might want to compute the mean GDP per continent. Finally, you want to store this data for further processing in a new data structure.
We use a simpler example here. Imagine some fictional weather data about the number of sunny hours per day and city:
The groups
attributes...
In this chapter we have looked at ways to manipulate data frames, from cleaning and filtering, to grouping, aggregation, and reshaping. Pandas makes a lot of the common operations very easy and more complex operations, such as pivoting or grouping by multiple attributes, can often be expressed as one-liners as well. Cleaning and preparing data is an essential part of data exploration and analysis.
The next chapter explains a brief of machine learning algorithms that is applying data analysis result to make decisions or build helpful products.
Practice exercises
Exercise 1: Cleaning: In the section about filtering, we used the Europe Brent Crude Oil Spot Price, which can be found as an Excel document on the internet. Take this Excel spreadsheet and try to convert it into a CSV document that is ready to be imported with Pandas.
Hint: There are many ways to do this. We used a small tool called xls2csv.py
and we were able to load the resulting CSV file with a helper method: