Reader small image

You're reading from  Scientific Computing with Python - Second Edition

Product typeBook
Published inJul 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781838822323
Edition2nd Edition
Languages
Right arrow
Authors (3):
Claus Führer
Claus Führer
author image
Claus Führer

Claus Führer is a professor of scientific computations at Lund University, Sweden. He has an extensive teaching record that includes intensive programming courses in numerical analysis and engineering mathematics across various levels in many different countries and teaching environments. Claus also develops numerical software in research collaboration with industry and received Lund University's Faculty of Engineering Best Teacher Award in 2016.
Read more about Claus Führer

View More author details
Right arrow
Series and Dataframes - Working with Pandas

In this chapter, we give a brief introduction to pandas—the central tool in Python for data analysis and data handling. You will learn how to work with various time series in Python, the concept of dataframes, and how to access and visualize data. You will also find examples that demonstrate how smoothly pandas interacts with the other central modules in this book, namely NumPy and Matplotlib.

But please note, this chapter can, within the scope of this book, only serve as an appetizer. Its purpose is to equip you with the basic concepts. The full range of visualization, data analysis, and data conversion tools in pandas is impressive.

pandas offers many ways of importing data. Some of them will be presented together with guiding examples throughout this chapter.

The following topics will be covered in this chapter:

  • A guiding...

10. 1 A guiding example: Solar cells

To describe pandas in the best way, we need data. Thus, in this chapter, we will use production data from solar cell panels on the roof of a private house in the south of Sweden.

In the file solarWatts.dat there is data about the electricity production in watts per minute. A semicolon is used as a data separator and the first line in the file is a header line, explaining the content of the data columns:

Date;Watt
:
2019-10-20 08:22:00 ; 44.0
2019-10-20 08:23:00 ; 61.0
2019-10-20 08:24:00 ; 42.0
:

In another file, price.dat, we find the hourly electricity production price in Swedish crowns. The file is otherwise organized as before:

Date;SEK
2019-10-20 01:00 ; 0.32
2019-10-20 02:00 ; 0.28
2019-10-20 03:00 ; 0.29
:

Finally, in a third file, rates.dat, we find the daily conversion rates from Swedish crowns to Euros (€):

Date;Euro_SEK
2019-10-21 ; 10.7311
2019-10-22 ; 10.7303
2019-10-23 ; 10.7385
:

We want to...

10.2 NumPy arrays and pandas dataframes

Let's start by just looking at an example of a  NumPy array:

A=array( [[ 1., 2., 3.],
[4., 5., 6.]])

It is displayed as:

[[1. 2. 3.]
[4. 5. 6.]]

And its elements are accessed by using indexes generated simply by counting rows and columns, for example, A[0,1].

This matrix can be converted to the pandas datatype DataFrame by keeping the same data and order but representing and accessing it in a different way:

import pandas as pd
A=array( [[ 1., 2., 3.],
[ 4., 5., 6.]] )
AF = pd.DataFrame(A)

This DataFrame object, which we will explain in more detail in this chapter, is displayed as


0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 6.0

We see that a pandas dataframe has extra labels for the rows and columns called index and columns. These are the metadata of a dataframe.

Here, they coincide with NumPy's indexing, but that is not always so. The index and columns metadata allows the pandas dataframe to...

10.2.1 Indexing rules

Similar to the way dictionaries use keys to address a value, pandas dataframes use row labels—the dataframe indexand column labels to access individual values:

AF.loc['R1', 'C2']      # this returns 2.0 

Or to generate a subframe:

AF.loc[['R1','R2'],['C1','C2']]

Resulting in:

   C1   C2
R1 1 2.0
R2 4 5.0

You can also address a complete row by using the index label:

AF.loc['R1']

This returns a pandas Series object:

C1    1.0
C2 2.0
C3 3.0
Name: R1, dtype: float64

If loc or iloc are called with list arguments or slices, the result is a dataframe.

In that way, an individual dataframe element can also be addressed as follows:

AF.loc['R1'].loc['C1']    # returns 1.0

An entire column is addressed directly as follows:

AF['C1']

This again returns a pandas Series object:

R1    1.0
R2 4.0
Name: C1, dtype: float64

Alternatively...

10.3 Creating and modifying dataframes

We return now to the solar cell data and explain how to create a dataframe from a data file. The format of the file with the given data is CSV. Each line in the file contains one data record with a comma or another character string as a data separator. Here, we used a semicolon as a separator because in many countries the comma is used instead of a decimal separator.

10.3.1 Creating a dataframe from imported data

We want to organize the dataframe in such a way that the dates are used as the index of the dataframe. To be better prepared for operating with dates, we also want that the data import process automatically converts date strings to a pandas Timestamp object. Finally, you might have noted that the way the date is written in the data files is in the ISO-format YY-MM-DD format and not in the American MM-DD-YY or the European DD-MM-YY format. We can put on our wishlist that pandas automatically recognizes the date format and performs the correct conversion:

solarWatts = pd.read_csv("solarWatts.dat", 
sep=';',
index_col='Date',
parse_dates=[0], infer_datetime_format=True)

The pandas command read_csv is the central tool. It has many more parameters than we used here and carefully studying their functionalities saves a lot of programming...

10.3.2 Setting the index

The default index of a dataframe is the row numbers. These are generated automatically when a dataframe is created and no index is specified. Here is an example.

We create a dataframe from a list of lists:

towns=[['Stockholm', 'Sweden', 188,975904],
['Malmö', 'Sweden', 322, 316588],
['Oslo', 'Norway', 481, 693491],
['Bergen', 'Norway', 464, 28392]]
town=pd.DataFrame(towns, columns=['City','Country','area','population'])

This produces a dataframe with rows labeled by their row numbers:

        City Country  area  population
0 Stockholm Sweden 188 975904
1 Malmö Sweden 322 316588
2 Oslo Norway 481 693491
3 Bergen Norway 464 28392

We can change this by choosing a column to be the index. The column can be duplicatedone serving as the index and the other belonging to the...

10.3.3 Deleting entries

Entries in a dataframe are deleted by the method drop.

Again, we use the dataframe from the previous section:

town=pd.DataFrame(towns, columns=['City','Country','area','population'])
town.set_index('City', inplace=True)

An entire row is deleted by:

town.drop('Bergen', axis=0)

The parameter axis specifies here that we look for a row. Deleting a row requires the column label and the correct parameter axis:

town.drop('area', axis=1)

10.3.4 Merging dataframes

From the three datafiles we provided for this chapter we used the first one, solarwatts.dat, to set up a dataframe solarWatts; see Section 10.3.1, Creating a dataframe from imported data. In a similar way, we can create dataframes price and rates from the other two files.

We show now how to merge these three dataframes into one and to treat rows with missing data in the resulting dataframe.

First, we merge solarWatts with price. For this, we use the pandas command merge:

solar_all=pd.merge(solarWatts, price, how='outer', sort=True, on='Date')
solar_all=pd.merge(solar_all, rates, how='outer', sort=True, on='Date')

It sets the column Date, which exists in both dataframes as the index of the new frame. The parameter how defines how to set up the new index column. By specifying outer we decided to choose the union of both index columns. Finally, we want to sort the index.

As solarWatts has data for every minute and...

10.3.5 Missing data in a dataframe

We saw in the last section that missing data is often indicated by NaN. The way missing data is indicated depends on the datatype of the column. Missing timestamps are indicated by the pandas object NaT, while missing data with another non-numeric datatype is indicated by None.

The dataframe method isnull returns a Boolean dataframe with the entry True at all places with missing data.

We will study various methods for treating missing data before returning to the solar cell data example.

Let's demonstrate these methods on a small dataframe:

frame = pd.DataFrame(array([[1., -5.,  3., NaN], 
[3., 4., NaN, 17.],
[6., 8., 11., 7.]]),
columns=['a','b','c','d'])

This dataframe is displayed as:

     a    b    c     d
0 1.0 -5.0 3.0 NaN
1 3.0 4.0 NaN 17.0
2 6.0 8.0 11.0 7.0

Dataframes with missing data can be handled in different...

10.4 Working with dataframes

So far, we have seen how to create and modify a dataframe. Now, we turn to data interpretation. We will look at examples of visualization, show how to make simple calculations, and see how to group data. These are all stepping stones into the world of pandas. The strength of this module is in its large range of statistical tools. We leave the presentation of these tools to textbooks on practical statistics, while we focus here on the basic principles of pandas programming. We do not aim for completeness. Again, let's serve an appetizer.

10.4.1 Plotting from dataframes

To demonstrate the plotting functionality, we plot the energy price changes on May 16, 2020. For this, we construct a subframe of the data from that day:

solar_all.loc['2020-05-16'].plot(y='SEK')

You can see that we indexed here with the full day. This is a short form of slicing:

solar_all.loc['2020-05-16 00:00':'2020-05-16 23:59']

The resulting plot, Figure 10.1, shows the hourly variation of electricity prices in Swedish crowns on a typical day of the year.

Figure 10.1: Plotting one column of a dataframe; the hourly price in Swedish crowns (SEK) per kWh on May 16, 2020

pandas' plot command is built upon plot from the module matplotlib.pyplot, which we met in Chapter 6, Plotting.

It accepts the same parameters, for example, linestyle or marker.

The data for the x axis is taken from the dataframe index if not otherwise specified. Alternatively, you can plot one dataframe column versus another.

Line...

10.4.2 Calculations within dataframes

We can do simple calculations on dataframe columns by applying functions on every element of the column, that is, elementwise application of functions. These functions can be built-in Python functions, NumPy functions, or user-defined functions, such as lambda functions (see Section 7.7, Anonymous functions).

The simplest way is to operate on the columns directly. In the following example, we convert watts into kilowatts and Swedish crowns (SEK) into Euros by using the conversion rate, which was the actual rate on the day of the measurement:

solar_converted=pd.DataFrame()
solar_converted['kW']=solar_all['Watt']/1000
solar_converted['Euro']=solar_all['SEK']/solar_all['Euro_SEK']

Tacitly, we also adjusted the column labels to the converted units.

The command solar_converted.loc['2020-07-01 7:00':'2020-07-01 7:04'] then returns the converted data for July...

10.4.3 Grouping data

The ability to group data is one of the essential features for pandas' dataframes. In the solar cell example, you saw that we had a data frequency of one measurement per minute. What if you want to report on an hourly or daily basis instead? We just form groups and aggregate the data in a prescribed way.

The following example forms a new dataframe with the two columns labeled Watt and SEK reporting the peak solar cell power per day and the average price in SEK:

solar_day=solar_all.groupby(solar_all.index.date).agg({'Watt':'max', 
'SEK':'mean'})

Again, we can visualize the results by using the dataframe method plot:

solar_day.index=pd.to_datetime(solar_day.index,format='%Y-%m-%d')
ax=solar_day.loc['2020-06-01':'2020-06-30'].plot.bar('Watt')

Note, we created an axis object, ax, in order to change the tick labels on the ...

10.5 Summary

In this chapter, you got a brief introduction to pandas and saw how the concept of NumPy arrays is extended to dataframes. Instead of a never-complete explanation of the possibilities of dataframes, we guided you using an example of solar cell energy data through the first steps with pandas: setting up a dataframe from a file, merging frames, grouping data, and making calculations.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scientific Computing with Python - Second Edition
Published in: Jul 2021Publisher: PacktISBN-13: 9781838822323
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Claus Führer

Claus Führer is a professor of scientific computations at Lund University, Sweden. He has an extensive teaching record that includes intensive programming courses in numerical analysis and engineering mathematics across various levels in many different countries and teaching environments. Claus also develops numerical software in research collaboration with industry and received Lund University's Faculty of Engineering Best Teacher Award in 2016.
Read more about Claus Führer