Time Series Analysis with Python Cookbook

By Tarek A. Atwan
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Chapter 2: Reading Time Series Data from Files
About this book

Time series data is everywhere, available at a high frequency and volume. It is complex and can contain noise, irregularities, and multiple patterns, making it crucial to be well-versed with the techniques covered in this book for data preparation, analysis, and forecasting.

This book covers practical techniques for working with time series data, starting with ingesting time series data from various sources and formats, whether in private cloud storage, relational databases, non-relational databases, or specialized time series databases such as InfluxDB. Next, you’ll learn strategies for handling missing data, dealing with time zones and custom business days, and detecting anomalies using intuitive statistical methods, followed by more advanced unsupervised ML models. The book will also explore forecasting using classical statistical models such as Holt-Winters, SARIMA, and VAR. The recipes will present practical techniques for handling non-stationary data, using power transforms, ACF and PACF plots, and decomposing time series data with multiple seasonal patterns. Later, you’ll work with ML and DL models using TensorFlow and PyTorch.

Finally, you’ll learn how to evaluate, compare, optimize models, and more using the recipes covered in the book.

Publication date:
June 2022
Publisher
Packt
Pages
630
ISBN
9781801075541

 

Chapter 2: Reading Time Series Data from Files

In this chapter, we will use pandas, a popular Python library with a rich set of I/O tools, data wrangling, and date/time functionality to streamline working with time series data. In addition, you will explore several reader functions available in pandas to ingest data from different file types, such as Comma-Separated Value (CSV), Excel, and SAS. You will explore reading from files, whether they are stored locally on your drive or remotely on the cloud, such as an AWS S3 bucket.

Time series data is complex and can be in different shapes and formats. Conveniently, the pandas reader functions offer a vast number of arguments (parameters) to help handle such variety in the data.

The pandas library provides two fundamental data structures, Series and DataFrame, implemented as classes. The DataFrame class is a distinct data structure for working with tabular data (think rows and columns in a spreadsheet). The main difference between the two data structures is that a Series is one-dimensional (single column), and a DataFrame is two-dimensional (multiple columns). The relationship between the two is that you get a Series when you slice out a column from a DataFrame. You can think of a DataFrame as a side-by-side concatenation of two or more Series objects.

A particular feature of the Series and DataFrames data structures is that they both have a labeled axis called index. A specific type of index that you will often see with time series data is the DatetimeIndex which you will explore further in this chapter. Generally, the index makes slicing and dicing operations very intuitive. For example, to make a DataFrame ready for time series analysis, you will learn how to create DataFrames with an index of type DatetimeIndex.

We will cover the following recipes on how to ingest data into a pandas DataFrame:

  • Reading data from CSVs and other delimited files
  • Reading data from an Excel file
  • Reading data from URLs
  • Reading data from a SAS dataset

    Why DatetimeIndex?

    A pandas DataFrame with an index of type DatetimeIndex unlocks a large set of features and useful functions needed when working with time series data. You can think of it as adding a layer of intelligence or awareness to pandas to treat the DataFrame as a time series DataFrame.

 

Technical requirements

In this chapter and forward, we will be using pandas 1.4.2 (released April 2, 2022) extensively.

Throughout our journey, you will be installing additional Python libraries to use in conjunction with pandas. You can download the Jupyter notebooks from the GitHub repository (https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./blob/main/code/Ch2/Chapter%202.ipynb) to follow along.

You can download the datasets used in this chapter from the GitHub repository using this link: https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./tree/main/datasets/Ch2.

 

Reading data from CSVs and other delimited files

In this recipe, you will use the pandas.read_csv() function, which offers a large set of parameters that you will explore to ensure the data is properly read into a time series DataFrame. In addition, you will learn how to specify an index column, parse the index to be of the type DatetimeIndex, and parse string columns that contain dates into datetime objects.

Generally, using Python, data read from a CSV file will be in string format (text). When using the read_csv method in pandas, it will try and infer the appropriate data types (dtype), and, in most cases, it does a great job at that. However, there are situations where you will need to explicitly indicate which columns to cast to a specific data type. For example, you will specify which column(s) to parse as dates using the parse_dates parameter in this recipe.

Getting ready

You will be reading a CSV file that contains hypothetical box office numbers for a movie. The file is provided in the GitHub repository for this book. The data file is in datasets/Ch2/movieboxoffice.csv.

How to do it…

You will ingest our CSV file using pandas and leverage some of the available parameters in read_csv:

  1. First, let's load the libraries:
    import pandas as pd
    from pathlib import Path
  2. Create a Path object for the file location:
    filepath =\
     Path('../../datasets/Ch2/movieboxoffice.csv')
  3. Read the CSV file into a DataFrame using the read_csv function and passing the filepath with additional parameters.

The first column in the CSV file contains movie release dates, and it needs to be set as an index of type DatetimeIndex (index_col=0 and parse_dates=['Date']). Specify which columns you want to include by providing a list of column names to usecols. The default behavior is that the first row includes the header (header=0):

ts = pd.read_csv(filepath,
                 header=0,
                 parse_dates=['Date'],
                 index_col=0,
                 infer_datetime_format=True,
                 usecols=['Date',
                          'DOW',
                          'Daily',
                          'Forecast',
                          'Percent Diff'])
ts.head(5)

This will output the following first five rows:

Figure 2.1 – The first five rows of the ts DataFrame using JupyterLab

Figure 2.1 – The first five rows of the ts DataFrame using JupyterLab

  1. Print a summary of the DataFrame to check the index and column data types:
    ts.info()
    >> <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 128 entries, 2021-04-26 to 2021-08-31
    Data columns (total 4 columns):
     #   Column        Non-Null Count  Dtype
    ---  ------        --------------  -----
     0   DOW           128 non-null    object
     1   Daily         128 non-null    object
     2   Forecast      128 non-null    object
     3   Percent Diff  128 non-null    object
    dtypes: object(4)
    memory usage: 5.0+ KB
  2. Notice that the Date column is now an index (not a column) of type DatetimeIndex. Additionally, both Daily and Forecast columns have the wrong dtype inference. You would expect them to be of type float. The issue is due to the source CSV file containing dollar signs ($) and thousand separators (,) in both columns. The presence of non-numeric characters will cause the columns to be interpreted as strings. A column with the dtype object indicates either a string column or a column with mixed dtypes (not homogeneous).

To fix this, you need to remove both the dollar sign ($) and thousand separators (,) or any non-numeric character. You can accomplish this using str.replace(), which can take a regular expression to remove all non-numeric characters but exclude the period (.) for the decimal place. Removing these characters does not convert the dtype, so you will need to cast those two columns as a float dtype using .astype(float):

clean = lambda x: x.str.replace('[^\d]','', regex=True)
c_df = ts[['Daily', 'Forecast']].apply(clean, axis=1)
ts[['Daily', 'Forecast']] = c_df.astype(float)

Print a summary of the updated DataFrame:

ts.info()
>> <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 128 entries, 2021-04-26 to 2021-08-31
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   DOW           128 non-null    object
 1   Daily         128 non-null    float64
 2   Forecast      128 non-null    float64
 3   Percent Diff  128 non-null    object
dtypes: float64(2), object(2)
memory usage: 5.0+ KB

Now, you have a DataFrame with DatetimeIndex and both Daily and Forecast columns are of dtype float64 (numeric fields).

How it works…

Using pandas for data transformation is fast since it loads the data into memory. For example, the read_csv method reads and loads the entire data into a DataFrame in memory. When requesting a DataFrame summary with the info() method, in addition to column and index data types, the output will display memory usage for the entire DataFrame. To get the exact memory usage for each column, including the index, you can use the memory_usage() method:

ts.memory_usage()
>> 
Index           1024
DOW             1024
Daily           1024
Forecast        1024
Percent Diff    1024
dtype: int64

The total will match what was provided in the DataFrame summary:

ts.memory_usage().sum()
>> 5120

So far, you have used a few of the available parameters when reading a CSV file using read_csv. The more familiar you become with the different options available in any of the pandas reader functions, the more upfront preprocessing you can do during data ingestion (reading).

You leveraged the built-in parse_dates argument, which takes in a list of columns (either specified by name or position).The combination of index_col=0 and parse_dates=[0] produced a DataFrame with an index of type DatetimeIndex.

Let's inspect the parameters used in this recipe as defined in the official pandas.read_csv() documentation (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html):

  • filepath_or_buffer: This is the first positional argument and the only required field needed (at a minimum) to read a CSV file. Here, you passed the Python path object named filepath. This can also be a string that represents a valid file path such as '../../datasets/Ch2/movieboxoffice.csv' or a URL that points to a remote file location, such as an AWS S3 bucket (we will examine this later in the Reading data from URLs recipe in this chapter).
  • sep: This takes a string to specify which delimiter to use. The default is a comma delimiter (,) which assumes a CSV file. If the file is separated by another delimiter, such as a pipe (|) or semicolon (;), then the argument can be updated, such as sep="|" or sep=";".

Another alias to sep is delimiter, which can be used as well as a parameter name.

  • header: In this case, you specified that the first row (0) value contains the header information. The default value is infer, which usually works as-is in most cases. If the CSV does not contain a header, then you specify header=None. If the CSV has a header but you prefer to supply custom column names, then you need to specify header=0 and overwrite it by providing a list of new column names to the names argument.

Recall that you specified which columns to include by passing a list of column names to the usecols parameter. These names are based on the file header (the first row of the CSV file).

If you decide to provide custom header names, you cannot reference the original names in the use_cols parameter; this will produce the following error: ValueError: Usecols do not match columns.

  • parse_dates: In the recipe, you provided a list of column positions using [0], which specified only the first column (by position) should be parsed. The parse_dates argument can take a list of column names, such as ["Date"], or a list of column positions, such as [0, 3], indicating the first and the fourth columns. If you only intend to parse the index column(s) specified in the index_col parameter, you only need to pass True (Boolean).
  • index_cols: Here, you specified that the first column by position (index_col=0) will be used as the DataFrame index. Alternatively, you could provide the column name as a string (index_col='Date'). The parameter also takes in a list of integers (positional indices) or strings (column names), which would create a MultiIndex object.
  • usecols: The default value is None, which includes all the columns in the dataset. Limiting the number of columns to only those that are required results in faster parsing and overall lower memory usage, since you only bring in what is needed. The usecols arguments can take a list of column names, such as ['Date', 'DOW', 'Daily', 'Percent Diff', 'Forecast'] or a list of positional indices, such as [0, 1, 3, 7, 6], which would produce the same result.

There's more…

There are situations where parse_dates may not work (it just cannot parse the date). In such cases, the column(s) will be returned unchanged, and no error will be thrown. This is where the date_parser parameter can be useful.

For example, you can pass a lambda function that uses the to_datetime function in pandas to date_parser. You can specify the string representation for the date format inside to_datetime(), as demonstrated in the following code:

date_parser =lambda x: pd.to_datetime(x, format="%d-%b-%y")
ts = pd.read_csv(filepath,
                 parse_dates=[0],
                 index_col=0,
                 date_parser=date_parser,
                 usecols=[0,1,3, 7, 6])
ts.head()

The preceding code will print out the first five rows of the ts DataFrame displaying a properly parsed Date index.

Figure 2.2 – The first five rows of the ts DataFrame using JupyterLab

Figure 2.2 – The first five rows of the ts DataFrame using JupyterLab

Let's break it down. In the preceding code, you passed two arguments to the to_datetime function: the object to convert to datetime and an explicit format string. Since the date is stored as a string in the form 26-Apr-21, you passed "%d-%b-%y" to reflect that:

  • %d represents the day of the month, such as 01 or 02.
  • %b represents the abbreviated month name, such as Apr or May.
  • %y represents a two-digit year, such as 19 or 20.

Other common string codes include the following:

  • %Y represents the year as a four-digit number, such as 2020 or 2021.
  • %B represents the month's full name, such as January or February.
  • %m represents the month as a two-digit number, such as 01 or 02.

For more information on Python's string formats for representing dates, visit https://strftime.org.

See also

According to the pandas documentation, the infer_datetime_format parameter in read_csv() function can speed up the parsing by 5–10x. This is how you can add this to our original script:

ts = pd.read_csv(filepath,
                 header=0,
                 parse_dates=[0],
                 index_col=0,
                 infer_datetime_format= True,
                 usecols=['Date',
                          'DOW',
                          'Daily',
                          'Forecast',
                          'Percent Diff'])

Note that given the dataset is small, the speed improvement may be insignificant.

For more information, please refer to the pandas.read_csv documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html.

 

Reading data from an Excel file

To read data from an Excel file, you will need to use a different reader function from pandas. Generally, working with Excel files can be a challenge since the file can contain formatted multi-line headers, merged header cells, and images. They may also contain multiple worksheets with custom names (labels). Therefore, it is vital that you always inspect the Excel file first. The most common scenario is reading from an Excel file that contains data partitioned into multiple sheets, which is the focus of this recipe.

In this recipe, you will be using the pandas.read_excel() function and examining the various parameters available to ensure the data is read properly as a DataFrame with a DatetimeIndex for time series analysis. In addition, you will explore different options to read Excel files with multiple sheets.

Getting ready

To use pandas.read_excel(), you will need to install an additional library for reading and writing Excel files. In the read_excel() function, you will use the engine parameter to specify which library (engine) to use for processing an Excel file. Depending on the Excel file extension you are working with (for example, .xls or .xlsx), you may need to specify a different engine that may require installing an additional library.

The supported libraries (engines) for reading and writing Excel include xlrd, openpyxl, odf, and pyxlsb. When working with Excel files, the two most common libraries are usually xlrd and openpyxl.

The xlrd library only supports .xls files. So, if you are working with an older Excel format, such as .xls, then xlrd will do just fine. For newer Excel formats, such as .xlsx, we will need a different engine, and in this case, openpyxl would be the recommendation to go with.

To install openpyxl using conda, run the following command in the terminal:

>>> conda install openpyxl

To install using pip, run the following command:

>>> pip install openpyxl

We will be using the sales_trx_data.xlsx file, which you can download from the book's GitHub repository. See the Technical requirements section of this chapter. The file contains sales data split by year into two sheets (2017 and 2018), respectively.

How to do it…

You will ingest the Excel file (.xlsx) using pandas and openpyxl, and leverage some of the available parameters in read_excel():

  1. Import the libraries for this recipe:
    import pandas as pd
    from pathlib import Path
    filepath = \
    Path('../../datasets/Ch2/sales_trx_data.xlsx')
  2. Read the Excel (.xlxs) file using the read_excel()function. By default, pandas will only read from the first sheet. This is specified under the sheet_name parameter, which is set to 0 as the default value. Before passing a new argument, you can use pandas.ExcelFile first to inspect the file and determine the number of sheets available. The ExcelFile class will provide additional methods and properties, such as sheet_name, which returns a list of sheet names:
    excelfile = pd.ExcelFile(filepath)
    excelfile.sheet_name
    >> ['2017', '2018']

If you have multiple sheets, you can specify which sheets you want to ingest by passing a list to the sheet_name parameter in read_excel. The list can either be positional arguments, such as first, second, and fifth sheets with [0, 1, 4], sheet names with ["Sheet1", "Sheet2", "Sheet5"], or a combination of both, such as first sheet, second sheet, and a sheet named "Revenue" [0, 1, "Revenue"].

In the following code, you will use sheet positions to read both the first and second sheets (0 and 1 indexes). This will return a Python dictionary object with two DataFrames. Notet hat the returned dictionary (key-value pair) has numeric keys (0 and 1) representing the first and second sheets (positional index), respectively:

ts = pd.read_excel(filepath,
                   engine='openpyxl',
                   index_col=1,
                   sheet_name=[0,1],
                   parse_dates=True)
ts.keys()
>> dict_keys([0, 1])
  1. Alternatively, you can pass a list of sheet names. Notice that the returned dictionary keys are now strings and represent the sheet names as shown in the following code:
    ts = pd.read_excel(filepath,
                       engine='openpyxl',
                       index_col=1,
                       sheet_name=['2017','2018'],
                       parse_dates=True)
    ts.keys()
    >> dict_keys(['2017', '2018'])
  2. If you want to read from all the available sheets, you will pass None instead. The keys for the dictionary, in this case, will represent sheet names:
    ts = pd.read_excel(filepath,
                       engine='openpyxl',
                       index_col=1,
                       sheet_name=None,
                       parse_dates=True)
    ts.keys()
    >> dict_keys(['2017', '2018'])

The two DataFrames within the dictionary are identical (homogeneous-typed) in terms of their schema (column names and data types). You can inspect each DataFrame with ts['2017'].info() and ts['2018'].info().

They both have a DatetimeIndex object, which you specified in the index_col parameter. The 2017 DataFrame consists of 36,764 rows and the 2018 DataFrame consists of 37,360. In this scenario, you want to stack (combine) the two (think UNION in SQL) into a single DataFrame that contains all 74,124 rows and a DatetimeIndex that spans from 2017-01-01 to 2018-12-31.

To combine the two DataFrames along the index axis (stacked one on top of the other), you will use the pandas.concat() function. The default behavior of the concat() function is to concatenate along the index axis (axis=0). In the following code, you will explicitly specify which DataFrames to concatenate:

ts_combined = pd.concat([ts['2017'],ts['2018']])
ts_combined.info()
>> <class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 74124 entries, 2017-01-01 to 2018-12-31
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Line_Item_ID        74124 non-null  int64
 1   Credit_Card_Number  74124 non-null  int64
 2   Quantity            74124 non-null  int64
 3   Menu_Item           74124 non-null  object
dtypes: int64(3), object(1)
memory usage: 2.8+ MB
  1. When you have multiple DataFrames returned (think multiple sheets), you can use the concat() function on the returned dictionary. In other words, you can combine the concat() and read_excel() functions in one statement. In this case, you will end up with a MultiIndex DataFrame where the first level is the sheet name (or number) and the second level is the DatetimeIndex. For example, using the ts dictionary, you will get a two-level index: MultiIndex([('2017', '2017-01-01'), ..., ('2018', '2018-12-31')], names=[None, 'Date'], length=74124).

To reduce the number of levels, you can use the droplevel(level=0) method to drop the first level after pandas .concat() shown as follows:

ts_combined = pd.concat(ts).droplevel(level=0)
  1. If you are only reading one sheet, the behavior is slightly different. By default, sheet_name is set to 0, which means it reads the first sheet. You can modify this and pass a different value (single value), either the sheet name (string) or sheet position (integer). When passing a single value, the returned object will be a pandas DataFrame and not a dictionary:
    ts = pd.read_excel(filepath,
                       index_col=1,
                       sheet_name='2018',
                       parse_dates=True)
    type(ts)
    >> pandas.core.frame.DataFrame

Do note though that if you pass a single value inside two brackets ([1]), then pandas will interpret this differently and the returned object will be a dictionary that contains one DataFrame.

Lastly, note that you did not need to specify the engine in the last example. The read_csv function will determine which engine to use based on the file extension. So, for example, suppose the library for that engine is not installed. In that case, it will throw an ImportError message, indicating that the library (dependency) is missing.

How it works…

The pandas.read_excel() function has many common parameters with the pandas.read_csv() function that you used earlier. The read_excel function can either return a DataFrame object or a dictionary of DataFrames. The dependency here is whether you are passing a single value (scalar) or a list to sheet_name.

In the sales_trx_data.xlsx file, both sheets had the same schema (homogeneous- typed). The sales data was partitioned (split) by year, where each sheet contained sales for a particular year. In this case, concatenating the two DataFrames was a natural choice. The pandas.concat() function is like the DataFrame.append() function, in which the second DataFrame was added (appended) to the end of the first DataFrame. This should be similar in behavior to the UNION clause for those coming from a SQL background.

There's more…

An alternative method to reading an Excel file is with the pandas.ExcelFile() class, which returns a pandas ExcelFile object. Earlier in this recipe, you used ExcelFile() to inspect the number of sheets in the Excel file through the sheet_name property.

The ExcelFile class has several useful methods, including the parse() method to parse the Excel file into a DataFrame, similar to the pandas.read_excel() function.

In the following example, you will use the ExcelFile class to parse the first sheet, assign the first column as an index, and print the first five rows:

excelfile = pd.ExcelFile(filepath)
excelfile.parse(sheet_name='2017',
                index_col=1,
                parse_dates=True) head()

You should see similar results for the first five rows of the DataFrame:

Figure 2.3 – The first five rows of the ts DataFrame using JupyterLab

Figure 2.3 – The first five rows of the ts DataFrame using JupyterLab

From Figure 2.3, it should become clear that ExcelFile.parse() is equivalent to pandas.read_excel().

See also

For more information on pandas.read_excel() and pandas.ExcelFile(), please refer to the official documentation:

 

Reading data from URLs

Files can be downloaded and stored locally on your machine, or stored on a remote server or cloud location. In the earlier two recipes, Reading from CSVs and other delimited files, and Reading data from an Excel file, both files were stored locally.

Many of the pandas reader functions can read data from remote locations by passing a URL path. For example, both read_csv() and read_excel() can take a URL to read a file that is accessible via the internet. In this recipe, you will read a CSV file using pandas.read_csv() and Excel files using pandas.read_excel() from remote locations, such as GitHub and AWS S3 (private and public buckets). You will also read data directly from an HTML page into a pandas DataFrame.

Getting ready

You will need to install the AWS SDK for Python (Boto3) for reading files from S3 buckets. Additionally, you will learn how to use the storage_options parameter available in many of the reader functions in pandas to read from S3 without the Boto3 library.

To use an S3 URL (for example, s3://bucket_name/path-to-file) in pandas, you will need to install the s3fs library. You will also need to install an HTML parser for when we use read_html(). For example, for the parsing engine (the HTML parser), you can install either lxml or html5lib; pandas will pick whichever is installed (it will first look for lxml, and if that fails, then for html5lib). If you plan to use html5lib you will need to install Beautiful Soup (beautifulsoup4).

To install using pip, you can use the following command:

>>> pip install boto3 s3fs lxml

To install using Conda, you can use:

>>> conda install boto3 s3fs lxml -y

How to do it…

This recipe will present you with different scenarios when reading data from online (remote) sources. Let's import pandas upfront since you will be using it throughout this recipe:

import pandas as pd

Reading data from GitHub

Sometimes, you may find useful public data on GitHub that you want to use and read directly (without downloading). One of the most common file formats on GitHub are CSV files. Let's start with the following steps:

  1. To read a CSV file from GitHub, you will need the URL to the raw content. If you copy the file's GitHub URL from the browser and use it as the file path, you will get a URL that looks like this: https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./blob/main/datasets/Ch2/AirQualityUCI.csv. This URL is a pointer to the web page in GitHub and not the data itself; hence when using pd.read_csv(), it will throw an error:
    url = 'https://github.com/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./blob/main/datasets/Ch2/AirQualityUCI.csv'
    pd.read_csv(url)
    ParserError: Error tokenizing data. C error: Expected 1 fields in line 62, saw 2
  2. Instead, you will need the raw content, which will give you a URL that looks like this: https://media.githubusercontent.com/media/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./main/datasets/Ch2/AirQualityUCI.csv:
Figure 2.4 – The GitHub page for the CSV file. Note the View raw button

Figure 2.4 – The GitHub page for the CSV file. Note the View raw button

  1. In Figure 2.4, notice that the values are not comma-separated (not a comma-delimited file); instead, the file uses semicolon (;) to separate the values.

The first column in the file is the Date column. You will need to parse (parse_date parameter) and convert it to DatetimeIndex (index_col parameter).

Pass the new URL to pandas.read_csv():

url = 'https://media.githubusercontent.com/media/PacktPublishing/Time-Series-Analysis-with-Python-Cookbook./main/datasets/Ch2/AirQualityUCI.csv'
date_parser = lambda x: pd.to_datetime(x, format="%d/%m/%Y")
df = pd.read_csv(url,
                 delimiter=';',
                 index_col='Date',
                 date_parser=date_parser)
df.iloc[:3,1:4]
>>
            CO(GT)  PT08.S1(CO)  NMHC(GT)
Date
2004-03-10     2.6      1360.00       150
2004-03-10     2.0      1292.25       112
2004-03-10     2.2      1402.00        88

We successfully ingested the data from the CSV file in GitHub into a DataFrame and printed the first three rows of select columns.

Reading data from a public S3 bucket

AWS supports virtual-hosted-style URLs such as https://bucket-name.s3.Region.amazonaws.com/keyname, path-style URLs such as https://s3.Region.amazonaws.com/bucket-name/keyname, and using S3://bucket/keyname. Here are examples of how these different URLs may look for our file:

In this example, you will be reading the AirQualityUCI.xlsx file, which has only one sheet. It contains the same data as AirQualityUCI.csv, which we read earlier from GitHub.

Note that in the URL, you do not need to specify the region as us-east-1. us-east-1, which represents US East (North Virginia), is an exception and will not be the case for other regions:

url = 'https://tscookbook.s3.amazonaws.com/AirQualityUCI.xlsx'
df = pd.read_excel(url,
                   index_col='Date',
                   parse_dates=True)

Read the same file using the S3:// URL:

s3uri = 's3://tscookbook/AirQualityUCI.xlsx'
df = pd.read_excel(s3uri,
                   index_col='Date',
                   parse_dates=True)

You may get an error such as the following:

ImportError: Install s3fs to access S3

This indicates that either you do not have the s3fs library installed or possibly you are not using the right Python/Conda environment.

Reading data from a private S3 bucket

When reading files from a private S3 bucket, you will need to pass your credentials to authenticate. A convenient parameter in many of the I/O functions in pandas is storage_options, which allows you to send additional content with the request, such as a custom header or required credentials to a cloud service.

You will need to pass a dictionary (key-value pair) to provide the additional information along with the request, such as username, password, access keys, and secret keys to storage_options as in {"username": username, "password": password}.

Now, you will read the AirQualityUCI.csv file, located in a private S3 bucket:

  1. You will start by storing your AWS credentials in a config .cfg file outside your Python script. Then, use configparser to read the values and store them in Python variables. You do not want your credentials exposed or hardcoded in your code:
    # Example aws.cfg file
    [AWS]
    aws_access_key=your_access_key
    aws_secret_key=you_secret_key

You can load the aws.cfg file using config.read():

import configparser
config = configparser.ConfigParser()
config.read('aws.cfg')
AWS_ACCESS_KEY = config['AWS']['aws_access_key']
AWS_SECRET_KEY = config['AWS']['aws_secret_key']
  1. The AWS Access Key ID and Secret Access Key are now stored in AWS_ACCESS_KEY and AWS_SECRET_KEY. Use pandas.read_csv() to read the CSV file and update the storage_options parameter by passing your credentials, as shown in the following code:
    s3uri = "s3://tscookbook-private/AirQuality.csv"
    df = pd.read_csv(s3uri,
                     index_col='Date',
                     parse_dates=True,
                     storage_options= {
                             'key': AWS_ACCESS_KEY,
                             'secret': AWS_SECRET_KEY
                         })
    df.iloc[:3, 1:4]
    >>
               CO(GT)  PT08.S1(CO)  NMHC(GT)
    Date
    2004-10-03    2,6       1360.0     150.0
    2004-10-03      2       1292.0     112.0
      2,2       1402.0      88.0
  2. Alternatively, you can use the AWS SDK for Python (Boto3) to achieve similar results. The boto3 Python library gives you more control and additional capabilities (beyond just reading and writing to S3). You will pass the same credentials stored earlier in AWS_ACCESS_KEY and AWS_SECRET_KEY and pass them to AWS, using boto3 to authenticate:
    import boto3
    bucket = "tscookbook-private"
    client = boto3.client("s3",
                      aws_access_key_id =AWS_ACCESS_KEY,
                      aws_secret_access_key = AWS_SECRET_KEY)

Now, the client object has access to many methods specific to the AWS S3 service for creating, deleting, and retrieving bucket information, and more. In addition, Boto3 offers two levels of APIs: client and resource. In the preceding example, you used the client API.

The client is a low-level service access interface that gives you more granular control, for example, boto3.client("s3"). The resource is a higher-level object-oriented interface (an abstraction layer), for example, boto3.resource("s3").

In Chapter 4, Persisting Time Series Data to Files, you will explore the resource API interface when writing to S3. For now, you will use the client interface.

  1. You will use the get_object method to retrieve the data. Just provide the bucket name and a key. The key here is the actual filename:
    data = client.get_object(Bucket=bucket, Key='AirQuality.csv')
    df = pd.read_csv(data['Body'],
                     index_col='Date',
                     parse_dates=True)
         
    df.iloc[:3, 1:4]
    >>
               CO(GT)  PT08.S1(CO)  NMHC(GT)
    Date
    2004-10-03    2,6       1360.0     150.0
    2004-10-03      2       1292.0     112.0
    2004-10-03    2,2       1402.0      88.0
  2. When calling the client.get_object() method, a dictionary (key-value pair) is returned, as shown in the following example:
    {'ResponseMetadata': {
    'RequestId':'MM0CR3XX5QFBQTSG',
    'HostId':'vq8iRCJfuA4eWPgHBGhdjir1x52Tdp80ADaSxWrL4Xzsr
    VpebSZ6SnskPeYNKCOd/RZfIRT4xIM=',
    'HTTPStatusCode':200,
    'HTTPHeaders': {'x-amz-id-2': 'vq8iRCJfuA4eWPgHBGhdjir1x52
    Tdp80ADaSxWrL4XzsrVpebSZ6SnskPeYNKCOd/RZfIRT4xIM=',
       'x-amz-request-id': 'MM0CR3XX5QFBQTSG',
       'date': 'Tue, 06 Jul 2021 01:08:36 GMT',
       'last-modified': 'Mon, 14 Jun 2021 01:13:05 GMT',
       'etag': '"2ce337accfeb2dbbc6b76833bc6f84b8"',
       'accept-ranges': 'bytes',
       'content-type': 'binary/octet-stream',
       'server': 'AmazonS3',
       'content-length': '1012427'},
       'RetryAttempts': 0},
       'AcceptRanges': 'bytes',
     'LastModified': datetime.datetime(2021, 6, 14, 1, 13, 5, tzinfo=tzutc()),
     'ContentLength': 1012427,
     'ETag': '"2ce337accfeb2dbbc6b76833bc6f84b8"',
     'ContentType': 'binary/octet-stream',
     'Metadata': {},
     'Body': <botocore.response.StreamingBody at 0x7fe9c16b55b0>}

The content you are interested in is in the response body under the Body key. You passed data['Body'] to the read_csv() function, which loads the response stream (StreamingBody) into a DataFrame.

Reading data from HTML

pandas offers an elegant way to read HTML tables and convert the content into a pandas DataFrame using the pandas.read_html() function:

  1. In the following recipe, we will extract HTML tables from Wikipedia for COVID-19 pandemic tracking cases by country and by territory (https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory):
    url = "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
    results = pd.read_html(url)
    print(len(results))
    >> 71
  2. pandas.read_html() returned a list of DataFrames, one for each HTML table found in the URL. Keep in mind that the website's content is dynamic and gets updated regularly, and the results may vary. In our case, it returned 71 DataFrames. The DataFrame at index 15 contains summary on COVID-19 cases and deaths by region. Grab the DataFrame (at index 15) and assign it to the df variable, and print the returned columns:
    df = results[15]
    df.columns
    >> Index(['Region[28]', 'Total cases', 'Total deaths', 'Cases per million',
           'Deaths per million', 'Current weekly cases', 'Current weekly deaths',
           'Population millions', 'Vaccinated %[29]'],
          dtype='object')
  3. Display the first five rows for Total cases, Total deaths, and the Cases per million columns.
    df[['Total cases', 'Total deaths', 'Cases per million']].head()
    >>
         Total cases    Total deaths    Cases per million
    0    139300788      1083815         311412
    1    85476396       1035884         231765
    2    51507114       477420          220454
    3    56804073       1270477         132141
    4    21971862       417507          92789

How it works…

Most of the pandas reader functions accept a URL as a path. Examples include the following:

  • pandas.read_csv()
  • pandas.read_excel()
  • pandas.read_parquet()
  • pandas.read_table()
  • pandas.read_pickle()
  • pandas.read_orc()
  • pandas.read_stata()
  • pandas.read_sas()
  • pandas.read_json()

The URL needs to be one of the valid URL schemes that pandas supports, which includes http and https, ftp, s3, gs, or the file protocol.

The read_html() function is great for scraping websites that contain data in HTML tables. It inspects the HTML and searches for all the <table> elements within the HTML. In HTML, table rows are defined with the <tr> </tr> tags and headers with the <th></th> tags. The actual data (cell) is contained within the <td> </td> tags. The read_html() function looks for <table>, <tr>, <th>, and <td> tags and converts the content into a DataFrame, and assigns the columns and rows as they were defined in the HTML. If an HTML page contains more than one <table></table> tag, read_html will return them all and you will get a list of DataFrames.

The following code demonstrates how pandas.read_html()works:

import pandas as pd
html = """
 <table>
   <tr>
     <th>Ticker</th>
     <th>Price</th>
   </tr>
   <tr>
     <td>MSFT</td>
     <td>230</td>
   </tr>
   <tr>
     <td>APPL</td>
     <td>300</td>
   </tr>
     <tr>
     <td>MSTR</td>
     <td>120</td>
   </tr>
 </table>
 </body>
 </html>
 """
   
df = pd.read_html(html)
df[0]
>>
  Ticker  Price
0   MSFT    230
1   APPL    300
2   MSTR    120

In the preceding code, the read_html() function parsed the HTML code and converted the HTML table into a pandas DataFrame. The headers between the <th> and </th> tags represent the column names of the DataFrame, and the content between the <tr></td> and </td></tr> tags represent the row data of the DataFrame. Note that if you go ahead and delete the <table> and </table> table tags, you will get the ValueError: No tables found error.

There's more…

The read_html() function has an optional attr argument, which takes a dictionary of valid HTML <table> attributes, such as id or class. For example, you can use the attr parameter to narrow down the tables returned to those that match the class attribute sortable as in <table class="sortable">. The read_html function will inspect the entire HTML page to ensure you target the right set of attributes.

In the previous exercise, you used the read_html function on the COVID-19 Wikipedia page, and it returned 71 tables (DataFrames). The number of tables will probably increase as time goes by as Wikipedia gets updated. You can narrow down the result set and guarantee some consistency by using the attr option. First, start by inspecting the HTML code using your browser. You will see that several of the <table> elements have multiple classes listed, such as sortable. You can look for other unique identifiers.

<table class="wikitable sortable mw-datatable covid19-countrynames jquery-tablesorter" id="thetable" style="text-align:right;">

Note, if you get the error html5lib not found, please install it you will need to install both html5lib and beautifulSoup4.

To install using conda, use the following:

conda install html5lib beautifulSoup4

To install using pip, use the following:

pip install html5lib beautifulSoup4

Now, let's use the sortable class and request the data again:

url = "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
df = pd.read_html(url, attrs={'class': 'sortable'})
len(df)
>>  7
df[3].columns
>>
Index(['Region[28]', 'Total cases', 'Total deaths', 'Cases per million',
       'Deaths per million', 'Current weekly cases', 'Current weekly deaths',
       'Population millions', 'Vaccinated %[29]'],
      dtype='object')

The list returned a smaller subset of tables (from 71 down to 7).

See also

For more information, please refer to the official pandas.read_html documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html.

 

Reading data from a SAS dataset

In this recipe, you will read a SAS data file and, more specifically, a file with the SAS7BDAT extension. SAS is commercial statistical software that provides data mining, business intelligence, and advanced analytics capabilities. Many large organizations in various industries rely on SAS, so it is very common to encounter the need to read from a SAS dataset.

Getting ready

In this recipe, you will be using pandas to read a .sas7bdat file. These files can be extremely large, and you will be introduced to different ways to read such files more efficiently.

To get ready, you can download the SAS sample dataset from http://support.sas.com/kb/61/960.html. You will be reading the DCSKINPRODUCT.sas7bdat file.

The SAS data file is also provided in the GitHub repository for this book.

How to do it…

You will use the pandas.read_sas() function, which can be used to read both SAS XPORT (.xpt) and SAS7BDAT file formats. However, there is no SAS writer function in pandas:

  1. Start by importing pandas and creating the path variable to the file. This file is not large (14.7 MB) compared to a typical SAS file, which can be 100+ GB:
    import pandas as pd
    path = '../../datasets/Ch2/DCSKINPRODUCT.sas7bdat'
  2. One of the advantages of using pandas is that it provides data structures for in-memory analysis, hence the performance advantage when analyzing data. On the other hand, this can also be a constraint when loading large datasets into memory. Generally, the amount of data you can load is limited by the amount of memory available. However, this can be an issue if the dataset is too large and exceeds the amount of memory.

One way to tackle this issue is by using the chunksize parameter. The chunksize parameter is available in many reader and writer functions, including read_sas. The DCSKINPRODUCT.sas7bdat file contains 152130 records, so you will use a chunksize parameter to read 10000 records at a time:

df = pd.read_sas(path, chunksize=10000)
type(df)
>> pandas.io.sas.sas7bdat.SAS7BDATReader
  1. The returned object is not a DataFrame but a SAS7BDATReader object. You can think of this as an iterator object that you can iterate through. At each iteration or chunk, you get a DataFrame of 10,000 rows at a time. You can retrieve the first chunk using the next() method that is, df.next(). Every time you use the next() method, it will retrieve the next batch or chunk (the next 10,000 rows). You can also loop through the chunks, for example, to do some computations. This can be helpful when the dataset is too large to fit in memory, allowing you to iterate through manageable chunks to do some heavy aggregations. The following code demonstrates this concept:
    results = []
    for chunk in df:
        results.append(
            chunk)
    len(results)
    >> 16
    df = pd.concat(results)
    df.shape
    >> (152130, 5)

There were 16 chunks (DataFrames) in total; each chunk or DataFrame contained 10000 records. Using the concat function, you can combine all 16 DataDrames into a large DataFrame of 152130 records.

  1. Reread the data in chunks, and this time group by DATE and aggregate using sum and count, as shown in the following:
    df = pd.read_sas(path, chunksize=10000)
    results = []
    for chunk in df:
        results.append(
            chunk.groupby('DATE')['Revenue']
                 .agg(['sum', 'count']))
  2. The results object is now a list of DataFrames. Now, let's examine the result set:
    results[0].loc['2013-02-10']
    >>
    sum      923903.0
    count        91.0
    Name: 2013-02-10 00:00:00, dtype: float64
    results[1].loc['2013-02-10']
    >>
    sum      8186392.0
    count         91.0
    Name: 2013-02-10 00:00:00, dtype: float64
    results[2].loc['2013-02-10']
    >>
    sum      5881396.0
    count         91.0
    Name: 2013-02-10 00:00:00, dtype: float64
  3. From the preceding output, you can observe that we have another issue to solve. Notice that the observations for 2013-02-10 got split. This is a common issue when chunking since it splits the data disregarding their order or sequence.

You can resolve this by combining the results in a meaningful way. For example, you can use the reduce function in Python. The reduce function allows you to perform a rolling computation (also known as folding or reducing) based on some function you provide. The following code demonstrates how this can be implemented:

from functools import reduce
final = reduce(lambda x1, x2: x1.add(x2, fill_value=0), results)
type(final)
>> pandas.core.frame.DataFrame
final.loc['2013-02-10']
>>
sum      43104420.0
count        1383.0
Name: 2013-02-10 00:00:00, dtype: float64
final.shape
>> (110, 2)

From the preceding output, the 16 chunks or DataFrames were reduced to a single value per row (index). We leveraged the pandas.DataFrame.add() function to add the values and use zero (0) as a fill value when the data is missing.

How it works…

Using the chunksize parameter in the read_sas() function will not return a DataFrame but rather an iterator (a SAS7BDATReader object). The chunksize parameter is available in most reader functions in pandas, such as read_csv, read_hdf, and read_sql, to name a few. Similarly, using the chunkize parameter with those functions will also return an iterator.

If chunksize is not specified, the returned object would be a DataFrame of the entire dataset. This is because the default value is None in all the reader functions.

Chunking is great when the operation or workflow is simple and not sequential. An operation such as groupby can be complex and tricky when chunking, which is why we added two extra steps:

  • Stored the resulting DataFrame to a list.
  • Used Python's reduce() function, which takes two arguments, a function and an iterator. It then applies the function element-wise and does it cumulatively from left to right to reduce down to a one result set. We also leveraged the DataFrame's add() method, which matches DataFrame indices to perform an element-wise addition.

There's more…

There are better options when working with large files than using pandas, especially if you have memory constraints and cannot fit the entire data into the memory. Chunking is a great option, but it still has an overhead and relies on memory. The pandas library is a single-core framework and does not offer parallel computing capabilities. Instead, there are specialized libraries and frameworks for parallel processing designed to work with big data. Such frameworks do not rely on loading everything into memory and instead can utilize multiple CPU cores, disk usage, or expand into multiple worker nodes (think multiple machines). For example, Dask chunks your data, creates a computation graph, and parallelizes the smaller tasks (chunks) behind the scenes, thus speeding the overall processing time and reducing memory overhead.

These frameworks are great but will require you to spend time learning the framework and rewriting your code to leverage these capabilities. So, there is a steep learning curve initially. Luckily, this is where the Modin project comes into play. For example, the Modin library acts as a wrapper or, more specifically, an abstraction on top of Dask or Ray that uses a similar API to pandas. Modin makes optimizing your pandas' code much more straightforward without learning another framework, and all it takes is a single line of code.

Before installing any library, it is highly advised that you create a separate virtual environment, for example, using conda. The concept and purpose behind creating virtual environments were discussed in detail in Chapter 1, Getting Started with Time Series Analysis, with multiple examples.

To install Modin using Conda (with a Dask backend), run the following:

>> conda install -c conda-forge modin-dask

To install with Pip, use the following:

>> pip install modin[dask]

You will measure the time and memory usage using pandas and again using Modin. To measure memory usage, you will need to install the memory_profiler library.

>> pip install memory_profiler

The memory_profiler library provides IPython and Jupyter magics such as %memit and %mprun, similar to known magics such as %timeit and %time.

Start by loading the required libraries:

import memory_profiler 
import pandas as pd
path = '../../datasets/Ch2/large_file.csv'

You will start by using pandas to read the file large_file.csv:

%%time
%memit pd.read_csv(path).groupby('label_source').count()

The preceding code should output something similar to the following:

peak memory: 161.35 MiB, increment: 67.34 MiB
CPU times: user 364 ms, sys: 95.2 ms, total: 459 ms
Wall time: 1.03 s

Now, you will load Modin and specify Dask as the engine:

from modin.config import Engine
Engine.put("dask")  # Modin will use Dask
import modin.pandas as pd
from distributed import Client
client = Client()

Notice that in the preceding code that Modin has a pandas implementation. This way, you can leverage your existing code without modification. You will now rerun the same code:

%%time
%memit pd.read_csv(path).groupby('label_source').count()

The preceding code should produce an output similar to the following:

peak memory: 137.12 MiB, increment: 9.34 MiB
CPU times: user 899 ms, sys: 214 ms, total: 1.11 s
Wall time: 1.91 s

Observe how the peak memory was reduced from 160 MiB to 137.12 MiB using Modin (Dask). Most importantly, notice how the memory increment went down from 67 MiB to 9 MiB with Modin. Overall, with Modin, you got lower memory usage. However, Modin (Dask) will show more significant advantages with more extensive operations on larger datasets.

See also

About the Author
  • Tarek A. Atwan

    Tarek A. Atwan is a data analytics expert with over 16 years of international consulting experience, providing subject matter expertise in data science, machine learning operations, data engineering, and business intelligence. He has taught multiple hands-on coding boot camps, courses, and workshops on various topics, including data science, data visualization, Python programming, time series forecasting, and blockchain at different universities in the United States. He is regarded as an industry mentor and advisor, working with executive leaders in various industries to solve complex problems using a data-driven approach.

    Browse publications by this author
Time Series Analysis with Python Cookbook
Unlock this book and the full library FREE for 7 days
Start now