Python Data Cleaning Cookbook

5 (3 reviews total)
By Michael Walker
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

About this book

Getting clean data to reveal insights is essential, as directly jumping into data analysis without proper data cleaning may lead to incorrect results. This book shows you tools and techniques that you can apply to clean and handle data with Python. You'll begin by getting familiar with the shape of data by using practices that can be deployed routinely with most data sources. Then, the book teaches you how to manipulate data to get it into a useful form. You'll also learn how to filter and summarize data to gain insights and better understand what makes sense and what does not, along with discovering how to operate on data to address the issues you've identified. Moving on, you'll perform key tasks, such as handling missing values, validating errors, removing duplicate data, monitoring high volumes of data, and handling outliers and invalid dates. Next, you'll cover recipes on using supervised learning and Naive Bayes analysis to identify unexpected values and classification errors, and generate visualizations for exploratory data analysis (EDA) to visualize unexpected values. Finally, you'll build functions and classes that you can reuse without modification when you have new data.

By the end of this Python book, you'll be equipped with all the key skills that you need to clean data and diagnose problems within it.

Publication date:
December 2020
Publisher
Packt
Pages
436
ISBN
9781800565661

 

Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python. We work with pandas extensively in this book, along with popular packages such as numpy, matplotlib, and scipy.

A key pandas object is the data frame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores we discuss in this chapter. However, a pandas data frame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward, as the recipes in this book will demonstrate.

Before we can make use of this great functionality, we have to get our data into pandas. Data comes to us in a wide variety of formats: as CSV or Excel files, as tables from SQL databases, from statistical analysis packages such as SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web pages.

We examine tools for importing tabular data in this recipe. Specifically, we cover the following topics:

  • Importing CSV files
  • Importing Excel files
  • Importing data from SQL databases
  • Importing SPSS, Stata, and SAS data
  • Importing R data
  • Persisting tabular data
 

Technical requirements

The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook

 

Importing CSV files

The read_csv method of the pandas library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas data frame. In this recipe, we read a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data.

Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.

All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas read_csv method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.

Getting ready

Create a folder for this chapter and create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder and place the landtempssample.csv file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository. Here is a code sample from the beginning of the CSV file:

locationid,year,month,temp,latitude,longitude,stnelev,station,countryid,country
USS0010K01S,2000,4,5.27,39.9,-110.75,2773.7,INDIAN_CANYON,US,United States
CI000085406,1940,5,18.04,-18.35,-70.333,58.0,ARICA,CI,Chile
USC00036376,2013,12,6.22,34.3703,-91.1242,61.0,SAINT_CHARLES,US,United States
ASN00024002,1963,2,22.93,-34.2833,140.6,65.5,BERRI_IRRIGATION,AS,Australia
ASN00028007,2001,11,,-14.7803,143.5036,79.4,MUSGRAVE,AS,Australia

Note

This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-4. This is just a 100,000-row sample of the full dataset, which is also available in the repository.

How to do it…

We will import a CSV file into pandas, taking advantage of some very useful read_csv options:

  1. Import the pandas library and set up the environment to make viewing the output easier:
    >>> import pandas as pd
    >>> pd.options.display.float_format = '{:,.2f}'.format
    >>> pd.set_option('display.width', 85)
    >>> pd.set_option('display.max_columns', 8)
  2. Read the data file, set new names for the headings, and parse the date column.

    Pass an argument of 1 to the skiprows parameter to skip the first row, pass a list of columns to parse_dates to create a pandas datetime column from those columns, and set low_memory to False to reduce the usage of memory during the import process:

    >>> landtemps = pd.read_csv('data/landtempssample.csv',
    ...     names=['stationid','year','month','avgtemp','latitude',
    ...       'longitude','elevation','station','countryid','country'],
    ...     skiprows=1,
    ...     parse_dates=[['month','year']],
    ...     low_memory=False)
    >>> type(landtemps)
    <class 'pandas.core.frame.DataFrame'>
  3. Get a quick glimpse of the data.

    View the first few rows. Show the data type for all columns, as well as the number of rows and columns:

    >>> landtemps.head(7)
      month_year    stationid  ...  countryid        country
    0 2000-04-01  USS0010K01S  ...         US  United States
    1 1940-05-01  CI000085406  ...         CI          Chile
    2 2013-12-01  USC00036376  ...         US  United States
    3 1963-02-01  ASN00024002  ...         AS      Australia
    4 2001-11-01  ASN00028007  ...         AS      Australia
    5 1991-04-01  USW00024151  ...         US  United States
    6 1993-12-01  RSM00022641  ...         RS         Russia
    [7 rows x 9 columns]
    >>> landtemps.dtypes
    month_year    datetime64[ns]
    stationid             object
    avgtemp              float64
    latitude             float64
    longitude            float64
    elevation            float64
    station               object
    countryid             object
    country               object
    dtype: object
    >>> landtemps.shape
    (100000, 9)
  4. Give the date column a better name and view the summary statistics for average monthly temperature:
    >>> landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
    >>> landtemps.dtypes
    measuredate    datetime64[ns]
    stationid              object
    avgtemp               float64
    latitude              float64
    longitude             float64
    elevation             float64
    station                object
    countryid              object
    country                object
    dtype: object
    >>> landtemps.avgtemp.describe()
    count   85,554.00
    mean        10.92
    std         11.52
    min        -70.70
    25%          3.46
    50%         12.22
    75%         19.57
    max         39.95
    Name: avgtemp, dtype: float64
  5. Look for missing values for each column.

    Use isnull, which returns True for each value that is missing for each column, and False when not missing. Chain this with sum to count the missings for each column. (When working with Boolean values, sum treats True as 1 and False as 0. I will discuss method chaining in the There's more... section of this recipe):

    >>> landtemps.isnull().sum()
    measuredate        0
    stationid          0
    avgtemp        14446
    latitude           0
    longitude          0
    elevation          0
    station            0
    countryid          0
    country            5
    dtype: int64
  6. Remove rows with missing data for avgtemp.

    Use the subset parameter to tell dropna to drop rows where avgtemp is missing. Set inplace to True. Leaving inplace at its default value of False would display the data frame, but the changes we have made would not be retained. Use the shape attribute of the data frame to get the number of rows and columns:

    >>> landtemps.dropna(subset=['avgtemp'], inplace=True)
    >>> landtemps.shape
    (85554, 9)

That's it! Importing CSV files into pandas is as simple as that.

How it works...

Almost all of the recipes in this book use the pandas library. We refer to it as pd to make it easier to reference later. This is customary. We also use float_format to display float values in a readable way and set_option to make the terminal output wide enough to accommodate the number of variables.

Much of the work is done by the first line in step 2. We use read_csv to load a pandas data frame in memory and call it landtemps. In addition to passing a filename, we set the names parameter to a list of our preferred column headings. We also tell read_csv to skip the first row, by setting skiprows to 1, since the original column headings are in the first row of the CSV file. If we do not tell it to skip the first row, read_csv will treat the header row in the file as actual data.

read_csv also solves a date conversion issue for us. We use the parse_dates parameter to ask it to convert the month and year columns to a date value.

Step 3 runs through a few standard data checks. We use head(7) to print out all columns for the first 7 rows. We use the dtypes attribute of the data frame to show the data type of all columns. Each column has the expected data type. In pandas, character data has the object data type, a data type that allows for mixed values. shape returns a tuple, whose first element is the number of rows in the data frame (100,000 in this case) and whose second element is the number of columns (9).

When we used read_csv to parse the month and year columns, it gave the resulting column the name month_year. We use the rename method in step 4 to give that column a better name. We need to specify inplace=True to replace the old column name with the new column name in memory. The describe method provides summary statistics on the avgtemp column.

Notice that the count for avgtemp indicates that there are 85,554 rows that have valid values for avgtemp. This is out of 100,000 rows for the whole data frame, as provided by the shape attribute. The listing of missing values for each column in step 5 (landtemps.isnull().sum()) confirms this: 100,000 – 85,554 = 14,446.

Step 6 drops all rows where avgtemp is NaN. (The NaN value, not a number, is the pandas representation of missing values.) subset is used to indicate which column to check for missings. The shape attribute for landtemps now indicates that there are 85,554 rows, which is what we would expect given the previous count from describe.

There's more...

If the file you are reading uses a delimiter other than a comma, such as a tab, this can be specified in the sep parameter of read_csv. When creating the pandas data frame, an index was also created. The numbers to the far left of the output when head and sample were run are index values. Any number of rows can be specified for head or sample. The default value is 5.

Setting low_memory to False causes read_csv to parse data in chunks. This is easier on systems with lower memory when working with larger files. However, the full data frame will still be loaded into memory once read_csv completes successfully.

The landtemps.isnull().sum() statement is an example of chaining methods. First, isnull returns a data frame of True and False values, resulting from testing whether each column value is null. sum takes that data frame and sums the True values for each column, interpreting the True values as 1 and the False values as 0. We would have obtained the same result if we had used the following two steps:

>>> checknull = landtemps.isnull()
>>> checknull.sum()

There is no hard and fast rule for when to chain methods and when not to. I find it helpful to chain when I really think of something I am doing as being a single step, but only two or more steps, mechanically speaking. Chaining also has the side benefit of not creating extra objects that I might not need.

The dataset used in this recipe is just a sample from the full land temperatures database with almost 17 million records. You can run the larger file if your machine can handle it, with the following code:

>>> landtemps = pd.read_csv('data/landtemps.zip', compression='zip',
...     names=['stationid','year','month','avgtemp','latitude',
...       'longitude','elevation','station','countryid','country'],
...     skiprows=1,
...     parse_dates=[['month','year']],
...     low_memory=False)

read_csv can read a compressed ZIP file. We get it to do this by passing the name of the ZIP file and the type of compression.

See also

Subsequent recipes in this chapter, and in other chapters, set indexes to improve navigation over rows and merging.

A significant amount of reshaping of the Global Historical Climatology Network raw data was done before using it in this recipe. We demonstrate this in Chapter 8, Addressing Data Issues when Combining DataFrames. That recipe also shows how to read a text file that is not delimited, one that is fixed, by using read_fwf.

 

Importing Excel files

The read_excel method of the pandas library can be used to import data from an Excel file and load it into memory as a pandas data frame. In this recipe, we import an Excel file and handle some common issues when working with Excel files: extraneous header and footer information, selecting specific columns, removing rows with no data, and connecting to particular sheets.

Despite the tabular structure of Excel, which invites the organization of data into rows and columns, spreadsheets are not datasets and do not require people to store data in that way. Even when some data conforms to those expectations, there is often additional information in rows or columns before or after the data to be imported. Data types are not always as clear as they are to the person who created the spreadsheet. This will be all too familiar to anyone who has ever battled with importing leading zeros. Moreover, Excel does not insist that all data in a column be of the same type, or that column headings be appropriate for use with a programming language such as Python.

Fortunately, read_excel has a number of options for handling messiness in Excel data. These options make it relatively easy to skip rows and select particular columns, and to pull data from a particular sheet or sheets.

Getting ready

You can download the GDPpercapita.xlsx file, as well as the code for this recipe, from the GitHub repository for this book. The code assumes that the Excel file is in a data subfolder. Here is a view of the beginning of the file:

Figure 1.1 – View of the dataset

Figure 1.1 – View of the dataset

And here is a view of the end of the file:

Figure 1.2 – View of the dataset

Figure 1.2 – View of the dataset

Note

This dataset, from the Organisation for Economic Co-operation and Development, is available for public use at https://stats.oecd.org/.

How to do it…

We import an Excel file into pandas and do some initial data cleaning:

  1. Import the pandas library:
    >>> import pandas as pd
  2. Read the Excel per capita GDP data.

    Select the sheet with the data we need, but skip the columns and rows that we do not want. Use the sheet_name parameter to specify the sheet. Set skiprows to 4 and skipfooter to 1 to skip the first four rows (the first row is hidden) and the last row. We provide values for usecols to get data from column A and columns C through T (column B is blank). Use head to view the first few rows:

    >>> percapitaGDP = pd.read_excel("data/GDPpercapita.xlsx",
    ...    sheet_name="OECD.Stat export",
    ...    skiprows=4,
    ...    skipfooter=1,
    ...    usecols="A,C:T")
    >>> percapitaGDP.head()
                             Year   2001  ...   2017   2018
    0          Metropolitan areas    NaN  ...    NaN    NaN
    1              AUS: Australia     ..  ...     ..     ..
    2       AUS01: Greater Sydney  43313  ...  50578  49860
    3    AUS02: Greater Melbourne  40125  ...  43025  42674
    4     AUS03: Greater Brisbane  37580  ...  46876  46640
    [5 rows x 19 columns]
  3. Use the info method of the data frame to view data types and the non-null count:
    >>> percapitaGDP.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 702 entries, 0 to 701
    Data columns (total 19 columns):
     #   Column  Non-Null Count  Dtype 
    ---  ------  --------------  ----- 
     0   Year    702 non-null    object
     1   2001    701 non-null    object
     2   2002    701 non-null    object
     3   2003    701 non-null    object
     4   2004    701 non-null    object
     5   2005    701 non-null    object
     6   2006    701 non-null    object
     7   2007    701 non-null    object
     8   2008    701 non-null    object
     9   2009    701 non-null    object
     10  2010    701 non-null    object
     11  2011    701 non-null    object
     12  2012    701 non-null    object
     13  2013    701 non-null    object
     14  2014    701 non-null    object
     15  2015    701 non-null    object
     16  2016    701 non-null    object
     17  2017    701 non-null    object
     18  2018    701 non-null    object
    dtypes: object(19)
    memory usage: 104.3+ KB
  4. Rename the Year column to metro and remove the leading spaces.

    Give an appropriate name to the metropolitan area column. There are extra spaces before the metro values in some cases, and extra spaces after the metro values in others. We can test for leading spaces with startswith(' ') and then use any to establish whether there are one or more occasions when the first character is blank. We can use endswith(' ') to examine trailing spaces. We use strip to remove both leading and trailing spaces:

    >>> percapitaGDP.rename(columns={'Year':'metro'}, inplace=True)
    >>> percapitaGDP.metro.str.startswith(' ').any()
    True
    >>> percapitaGDP.metro.str.endswith(' ').any()
    True
    >>> percapitaGDP.metro = percapitaGDP.metro.str.strip()
  5. Convert the data columns to numeric.

    Iterate over all of the GDP year columns (2001-2018) and convert the data type from object to float. Coerce the conversion even when there is character data – the .. in this example. We want character values in those columns to become missing, which is what happens. Rename the year columns to better reflect the data in those columns:

    >>> for col in percapitaGDP.columns[1:]:
    ...   percapitaGDP[col] = pd.to_numeric(percapitaGDP[col], errors='coerce')
    ...   percapitaGDP.rename(columns={col:'pcGDP'+col}, inplace=True)
    ... 
    >>> percapitaGDP.head()
                          metro  pcGDP2001  ...  pcGDP2017  pcGDP2018
    0        Metropolitan areas        nan  ...        nan        nan
    1            AUS: Australia        nan  ...        nan        nan
    2     AUS01: Greater Sydney      43313  ...      50578      49860
    3  AUS02: Greater Melbourne      40125  ...      43025      42674
    4   AUS03: Greater Brisbane      37580  ...      46876      46640
    >>> percapitaGDP.dtypes
    metro         object
    pcGDP2001    float64
    pcGDP2002    float64
    abbreviated to save space
    pcGDP2017    float64
    pcGDP2018    float64
    dtype: object
  6. Use the describe method to generate summary statistics for all numeric data in the data frame:
    >>> percapitaGDP.describe()
           pcGDP2001  pcGDP2002  ...  pcGDP2017  pcGDP2018
    count        424        440  ...        445        441
    mean       41264      41015  ...      47489      48033
    std        11878      12537  ...      15464      15720
    min        10988      11435  ...       2745       2832
    25%        33139      32636  ...      37316      37908
    50%        39544      39684  ...      45385      46057
    75%        47972      48611  ...      56023      56638
    max        91488      93566  ...     122242     127468
    [8 rows x 18 columns]
  7. Remove rows where all of the per capita GDP values are missing.

    Use the subset parameter of dropna to inspect all columns, starting with the second column (it is zero-based) through the last column. Use how to specify that we want to drop rows only if all of the columns specified in subset are missing. Use shape to show the number of rows and columns in the resulting data frame:

    >>> percapitaGDP.dropna(subset=percapitaGDP.columns[1:], how="all", inplace=True)
    >>> percapitaGDP.describe()
           pcGDP2001  pcGDP2002  ...  pcGDP2017  pcGDP2018
    count        424        440  ...        445        441
    mean       41264      41015  ...      47489      48033
    std        11878      12537  ...      15464      15720
    min        10988      11435  ...       2745       2832
    25%        33139      32636  ...      37316      37908
    50%        39544      39684  ...      45385      46057
    75%        47972      48611  ...      56023      56638
    max        91488      93566  ...     122242     127468
    [8 rows x 18 columns]
    >>> percapitaGDP.head()
                          metro  pcGDP2001  ...  pcGDP2017  pcGDP2018
    2     AUS01: Greater Sydney      43313  ...      50578      49860
    3  AUS02: Greater Melbourne      40125  ...      43025      42674
    4   AUS03: Greater Brisbane      37580  ...      46876      46640
    5      AUS04: Greater Perth      45713  ...      66424      70390
    6   AUS05: Greater Adelaide      36505  ...      40115      39924
    [5 rows x 19 columns]
    >>> percapitaGDP.shape
    (480, 19)
  8. Set the index for the data frame using the metropolitan area column.

    Confirm that there are 480 valid values for metro and that there are 480 unique values, before setting the index:

    >>> percapitaGDP.metro.count()
    480
    >>> percapitaGDP.metro.nunique()
    480
    >>> percapitaGDP.set_index('metro', inplace=True)
    >>> percapitaGDP.head()
                         pcGDP2001  pcGDP2002  ...  pcGDP2017  pcGDP2018
    metro                                           ...                      
    AUS01: Greater Sydney    43313      44008  ...      50578      49860
    AUS02: Greater Melbourne 40125      40894  ...      43025      42674
    AUS03: Greater Brisbane  37580      37564  ...      46876      46640
    AUS04: Greater Perth     45713      47371  ...      66424      70390
    AUS05: Greater Adelaide  36505      37194  ...      40115      39924
    [5 rows x 18 columns]
    >>> percapitaGDP.loc['AUS02: Greater Melbourne']
    pcGDP2001   40125
    pcGDP2002   40894
    ...
    pcGDP2017   43025
    pcGDP2018   42674
    Name: AUS02: Greater Melbourne, dtype: float64

We have now imported the Excel data into a pandas data frame and cleaned up some of the messiness in the spreadsheet.

How it works…

We mostly manage to get the data we want in step 2 by skipping rows and columns we do not want, but there are still a number of issues: read_excel interprets all of the GDP data as character data, many rows are loaded with no useful data, and the column names do not represent the data well. In addition, the metropolitan area column might be useful as an index, but there are leading and trailing blanks and there may be missing or duplicated values.

read_excel interprets Year as the column name for the metropolitan area data because it looks for a header above the data for that Excel column and finds Year there. We rename that column metro in step 4. We also use strip to fix the problem with leading and trailing blanks. If there had only been leading blanks, we could have used lstrip, or rstrip if there had only been trailing blanks. It is a good idea to assume that there might be leading or trailing blanks in any character data and clean that data shortly after the initial import.

The spreadsheet authors used .. to represent missing data. Since this is actually valid character data, those columns get the object data type (how pandas treats columns with character or mixed data). We coerce a conversion to numeric in step 5. This also results in the original values of .. being replaced with NaN (not a number), pandas' value for missing numbers. This is what we want.

We can fix all of the per capita GDP columns with just a few lines because pandas makes it easy to iterate over the columns of a data frame. By specifying [1:], we iterate from the second column to the last column. We can then change those columns to numeric and rename them to something more appropriate.

There are several reasons why it is a good idea to clean up the column headings for the annual GDP columns: it helps us to remember what the data actually is; if we merge it with other data by metropolitan area, we will not have to worry about conflicting variable names; and we can use attribute access to work with pandas series based on those columns, which I will discuss in more detail in the There's more… section of this recipe.

describe in step 6 shows us that only between 420 and 480 rows have valid data for per capita GDP. When we drop all rows that have missing values for all per capita GDP columns in step 7, we end up with 480 rows in the data frame, which is what we expected.

There's more…

Once we have a pandas data frame, we have the ability to treat columns as more than just columns. We can use attribute access (such as percapitaGPA.metro) or bracket notation (percapitaGPA['metro']) to get the functionality of a pandas data series. Either method makes it possible to use data series string-inspecting methods such as str.startswith, and counting methods such as nunique. Note that the original column names of 20## did not allow for attribute access because they started with a number, so percapitaGDP.pcGDP2001.count() works, but percapitaGDP.2001.count() returns a syntax error because 2001 is not a valid Python identifier (since it starts with a number).

Pandas is rich with features for string manipulation and for data series operations. We will try many of them out in subsequent recipes. This recipe showed those I find most useful when importing Excel data.

See also

There are good reasons to consider reshaping this data. Instead of 18 columns of GDP per capita data for each metropolitan area, we should have 18 rows of data for each metropolitan area, with columns for year and GDP per capita. Recipes for reshaping data can be found in Chapter 9, Tidying and Reshaping Data.

 

Importing data from SQL databases

In this recipe, we will use pymssql and mysql apis to read data from Microsoft SQL Server and MySQL (now owned by Oracle) databases, respectively. Data from sources such as these tends to be well structured since it is designed to facilitate simultaneous transactions by members of organizations, and those who interact with them. Each transaction is also likely related to some other organizational transaction.

This means that although data tables from enterprise systems are more reliably structured than data from CSV files and Excel files, their logic is less likely to be self-contained. You need to know how the data from one table relates to data from another table to understand its full meaning. These relationships need to be preserved, including the integrity of primary and foreign keys, when pulling data. Moreover, well-structured data tables are not necessarily uncomplicated data tables. There are often sophisticated coding schemes that determine data values, and these coding schemes can change over time. For example, codes for staff ethnicity at a retail store chain might be different in 1998 than they are in 2020. Similarly, frequently there are codes for missing values, such as 99999, that pandas will understand as valid values.

Since much of this logic is business logic, and implemented in stored procedures or other applications, it is lost when pulled out of this larger system. Some of what is lost will eventually have to be reconstructed when preparing data for analysis. This almost always involves combining data from multiple tables, so it is important to preserve the ability to do that. But it also may involve adding some of the coding logic back after loading the SQL table into a pandas data frame. We explore how to do that in this recipe.

Getting ready

This recipe assumes you have the pymssql and mysql APIs installed. If you do not, it is relatively straightforward to install them with pip. From the terminal, or PowerShell (in Windows), enter pip install pymssql or pip install mysql-connector-python.

Note

The dataset used in this recipe is available for public use at https://archive.ics.uci.edu/ml/machine-learning-databases/00320/.

How to do it...

We import SQL Server and MySQL data tables into a pandas data frame as follows:

  1. Import pandas, numpy, pymssql, and mysql.

    This step assumes that you have installed the pymssql and mysql APIs:

    >>> import pandas as pd
    >>> import numpy as np
    >>> import pymssql
    >>> import mysql.connector
  2. Use the pymssql API and read_sql to retrieve and load data from a SQL Server instance.

    Select the columns we want from the SQL Server data and use SQL aliases to improve column names (for example, fedu AS fathereducation). Create a connection to the SQL Server data by passing database credentials to the pymssql connect function. Create a pandas data frame by passing the select statement and connection object to read_sql. Close the connection to return it to the pool on the server:

    >>> query = "SELECT studentid, school, sex, age, famsize,\
    ...   medu AS mothereducation, fedu AS fathereducation,\
    ...   traveltime, studytime, failures, famrel, freetime,\
    ...   goout, g1 AS gradeperiod1, g2 AS gradeperiod2,\
    ...   g3 AS gradeperiod3 From studentmath"
    >>> 
    >>> server = "pdcc.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
    >>> user = "pdccuser"
    >>> password = "pdccpass"
    >>> database = "pdcctest"
    >>> 
    >>> conn = pymssql.connect(server=server,
    ...   user=user, password=password, database=database)
    >>> 
    >>> studentmath = pd.read_sql(query,conn)
    >>> conn.close()
  3. Check the data types and the first few rows:
    >>> studentmath.dtypes
    studentid          object
    school             object
    sex                object
    age                 int64
    famsize            object
    mothereducation     int64
    fathereducation     int64
    traveltime          int64
    studytime           int64
    failures            int64
    famrel              int64
    freetime            int64
    goout               int64
    gradeperiod1        int64
    gradeperiod2        int64
    gradeperiod3        int64
    dtype: object
    >>> studentmath.head()
      studentid school  ... gradeperiod2  gradeperiod3
    0       001     GP  ...            6             6
    1       002     GP  ...            5             6
    2       003     GP  ...            8            10
    3       004     GP  ...           14            15
    4       005     GP  ...           10            10
    [5 rows x 16 columns]
  4. (Alternative) Use the mysql connector and read_sql to get data from MySQL.

    Create a connection to the mysql data and pass that connection to read_sql to retrieve the data and load it into a pandas data frame. (The same data file on student math scores was uploaded to SQL Server and MySQL, so we can use the same SQL SELECT statement we used in the previous step.):

    >>> host = "pdccmysql.c9sqqzd5fulv.us-west-2.rds.amazonaws.com"
    >>> user = "pdccuser"
    >>> password = "pdccpass"
    >>> database = "pdccschema"
    >>> connmysql = mysql.connector.connect(host=host,
    ...   database=database,user=user,password=password)
    >>> studentmath = pd.read_sql(sqlselect,connmysql)
    >>> connmysql.close()
  5. Rearrange the columns, set an index, and check for missing values.

    Move the grade data to the left of the data frame, just after studentid. Also move the freetime column to the right after traveltime and studytime. Confirm that each row has an ID and that the IDs are unique, and set studentid as the index:

    >>> newcolorder = ['studentid', 'gradeperiod1', 'gradeperiod2',
    ...   'gradeperiod3', 'school', 'sex', 'age', 'famsize',
    ...   'mothereducation', 'fathereducation', 'traveltime',
    ...   'studytime', 'freetime', 'failures', 'famrel',
    ...   'goout']
    >>> studentmath = studentmath[newcolorder]
    >>> studentmath.studentid.count()
    395
    >>> studentmath.studentid.nunique()
    395
    >>> studentmath.set_index('studentid', inplace=True)
  6. Use the data frame's count function to check for missing values:
    >>> studentmath.count()
    gradeperiod1       395
    gradeperiod2       395
    gradeperiod3       395
    school             395
    sex                395
    age                395
    famsize            395
    mothereducation    395
    fathereducation    395
    traveltime         395
    studytime          395
    freetime           395
    failures           395
    famrel             395
    goout              395
    dtype: int64
  7. Replace coded data values with more informative values.

    Create a dictionary with the replacement values for the columns, and then use replace to set those values:

    >>> setvalues={"famrel":{1:"1:very bad",2:"2:bad",3:"3:neutral",
    ...     4:"4:good",5:"5:excellent"},
    ...   "freetime":{1:"1:very low",2:"2:low",3:"3:neutral",
    ...     4:"4:high",5:"5:very high"},
    ...   "goout":{1:"1:very low",2:"2:low",3:"3:neutral",
    ...     4:"4:high",5:"5:very high"},
    ...   "mothereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
    ...     3:"3:secondary ed",4:"4:higher ed"},
    ...   "fathereducation":{0:np.nan,1:"1:k-4",2:"2:5-9",
    ...     3:"3:secondary ed",4:"4:higher ed"}}
     
    >>> studentmath.replace(setvalues, inplace=True)
    >>> setvalueskeys = [k for k in setvalues]
  8. Change the type for columns with the changed data to category.

    Check for any changes in memory usage:

    >>> studentmath[setvalueskeys].memory_usage(index=False)
    famrel             3160
    freetime           3160
    goout              3160
    mothereducation    3160
    fathereducation    3160
    dtype: int64
    >>> for col in studentmath[setvalueskeys].columns:
    ...     studentmath[col] = studentmath[col].astype('category')
    ... 
    >>> studentmath[setvalueskeys].memory_usage(index=False)
    famrel             595
    freetime           595
    goout              595
    mothereducation    587
    fathereducation    587
    dtype: int64
  9. Calculate percentages for values in the famrel column.

    Run value_counts and set normalize to True to generate percentages:

    >>> studentmath['famrel'].value_counts(sort=False, normalize=True)
    1:very bad    0.02
    2:bad         0.05
    3:neutral     0.17
    4:good        0.49
    5:excellent   0.27
    Name: famrel, dtype: float64
  10. Use apply to calculate percentages for multiple columns:
    >>> studentmath[['freetime','goout']].\
    ...   apply(pd.Series.value_counts, sort=False, normalize=True)
                 freetime  goout
    1:very low       0.05   0.06
    2:low            0.16   0.26
    3:neutral        0.40   0.33
    4:high           0.29   0.22
    5:very high      0.10   0.13
    >>> 
    >>> studentmath[['mothereducation','fathereducation']].\
    ...   apply(pd.Series.value_counts, sort=False, normalize=True)
                    mothereducation  fathereducation
    1:k-4                      0.15             0.21
    2:5-9                      0.26             0.29
    3:secondary ed             0.25             0.25
    4:higher ed                0.33             0.24

The preceding steps retrieved a data table from a SQL database, loaded that data into pandas, and did some initial data checking and cleaning.

How it works…

Since data from enterprise systems is typically better structured than CSV or Excel files, we do not need to do things such as skip rows or deal with different logical data types in a column. But some massaging is still usually required before we can begin exploratory analysis. There are often more columns than we need, and some column names are not intuitive or not ordered in the best way for analysis. The meaningfulness of many data values is not stored in the data table, to avoid entry errors and save on storage space. For example, 3 is stored for mother's education rather than secondary education. It is a good idea to reconstruct that coding as early in the cleaning process as possible.

To pull data from a SQL database server, we need a connection object to authenticate us on the server, and a SQL select string. These can be passed to read_sql to retrieve the data and load it into a pandas data frame. I usually use the SQL SELECT statement to do a bit of cleanup of column names at this point. I sometimes also reorder columns, but I do that later in this recipe.

We set the index in step 5, first confirming that every row has a value for studentid and that it is unique. This is often more important when working with enterprise data because we will almost always need to merge the retrieved data with other data files on the system. Although an index is not required for this merging, the discipline of setting one prepares us for the tricky business of merging data down the road. It will also likely improve the speed of the merge.

We use the data frame's count function to check for missing values and there are no missing values – non-missing values is 395 (the number of rows) for every column. This is almost too good to be true. There may be values that are logically missing; that is, valid numbers that nonetheless connote missing values, such as -1, 0, 9, or 99. We address this possibility in the next step.

Step 7 demonstrates a useful technique for replacing data values for multiple columns. We create a dictionary to map original values to new values for each column, and then run it using replace. To reduce the amount of storage space taken up by the new verbose values, we convert the data type of those columns to category. We do this by generating a list of the keys of our setvalues dictionary – setvalueskeys = [k for k in setvalues] generates [famrel, freetime, goout, mothereducation, and fathereducation]. We then iterate over those five columns and use the astype method to change the data type to category. Notice that the memory usage for those columns is reduced substantially.

Finally, we check the assignment of new values by using value_counts to view relative frequencies. We use apply because we want to run value_counts on multiple columns. To avoid value_counts sorting by frequency, we set sort to False.

The data frame replace method is also a handy tool for dealing with logical missing values that will not be recognized as missing when retrieved by read_sql. 0 values for mothereducation and fathereducation seem to fall into that category. We fix this problem in the setvalues dictionary by indicating that 0 values for mothereducation and fathereducation should be replaced with NaN. It is important to address these kinds of missing values shortly after the initial import because they are not always obvious and can significantly impact all subsequent work.

Users of packages such as SPPS, SAS, and R will notice the difference between this approach and value labels in SPSS and R, and proc format in SAS. In pandas, we need to change the actual data to get more informative values. However, we reduce how much data is actually stored by giving the column a category data type, similar to factors in R.

There's more…

I moved the grade data to near the beginning of the data frame. I find it helpful to have potential target or dependent variables in the leftmost columns, to keep them at the forefront of my thinking. It is also helpful to keep similar columns together. In this example, personal demographic variables (sex, age) are next to one another, as are family variables (mothereducation, fathereducation), and how students spend their time (traveltime, studytime, and freetime).

You could have used map instead of replace in step 7. Prior to version 19.2 of pandas, map was significantly more efficient. Since then, the difference in efficiency has been much smaller. If you are working with a very large dataset, the difference may still be enough to consider using map.

See also

The recipes in Chapter 8, Addressing Data Issues when Combining DataFrames, go into detail on merging data. We will take a closer look at bivariate and multivariate relationships between variables in Chapter 4, Identifying Missing Values and Outliers in Subsets of Data. We demonstrate how to use some of these same approaches in packages such as SPSS, SAS, and R in subsequent recipes in this chapter.

 

Importing SPSS, Stata, and SAS data

We will use pyreadstat to read data from three popular statistical packages into pandas. The key advantage of pyreadstat is that it allows data analysts to import data from these packages without losing metadata, such as variable and value labels.

The SPSS, Stata, and SAS data files we receive often come to us with the data issues of CSV and Excel files and SQL databases having been resolved. We do not typically have the invalid column names, changes in data types, and unclear missing values that we can get with CSV or Excel files, nor do we usually get the detachment of data from business logic, such as the meaning of data codes, that we often get with SQL data. When someone or some organization shares a data file from one of these packages with us, they have often added variable labels and value labels for categorical data. For example, a hypothetical data column called presentsat has the variable label overall satisfaction with presentation and value labels 1-5, with 1 being not at all satisfied and 5 being highly satisfied.

The challenge is retaining that metadata when importing data from those systems into pandas. There is no precise equivalent to variable and value labels in pandas, and built-in tools for importing SAS, Stata, and SAS data lose the metadata. In this recipe, we will use pyreadstat to load variable and value label information and use a couple of techniques for representing that information in pandas.

Getting ready

This recipe assumes you have installed the pyreadstat package. If it is not installed, you can install it with pip. From the terminal, or PowerShell (in Windows), enter pip install pyreadstat. You will need the SPSS, Stata, and SAS data files for this recipe to run the code.

We will work with data from the United States National Longitudinal Survey of Youth (NLS).

Note

The National Longitudinal Survey of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through 2017. For this recipe, I pulled 42 variables on grades, employment, income, and attitudes toward government, from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository. NLS data can be downloaded from https://www.nlsinfo.org/investigator/pages/search.

How to do it...

We will import data from SPSS, Stata, and SAS, retaining metadata such as value labels:

  1. Import pandas, numpy, and pyreadstat.

    This step assumes that you have installed pyreadstat:

    >>> import pandas as pd
    >>> import numpy as np
    >>> import pyreadstat
  2. Retrieve the SPSS data.

    Pass a path and filename to the read_sav method of pyreadstat. Display the first few rows and a frequency distribution. Notice that the column names and value labels are non-descriptive, and that read_sav creates both a pandas data frame and a meta object:

    >>> nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav')
    >>> nls97spss.dtypes
    R0000100    float64
    R0536300    float64
    R0536401    float64
    ...
    U2962900    float64
    U2963000    float64
    Z9063900    float64
    dtype: object
    >>> nls97spss.head()
       R0000100  R0536300  ...  U2963000  Z9063900
    0         1         2  ...       nan        52
    1         2         1  ...         6         0
    2         3         2  ...         6         0
    3         4         2  ...         6         4
    4         5         1  ...         5        12
    [5 rows x 42 columns]
    >>> nls97spss['R0536300'].value_counts(normalize=True)
    1.00   0.51
    2.00   0.49
    Name: R0536300, dtype: float64
  3. Grab the metadata to improve column labels and value labels.

    The metaspss object created when we called read_sav has the column labels and the value labels from the SPSS file. Use the variable_value_labels dictionary to map values to value labels for one column (R0536300). (This does not change the data. It only improves our display when we run value_counts.) Use the set_value_labels method to actually apply the value labels to the data frame:

    >>> metaspss.variable_value_labels['R0536300']
    {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}
    >>> nls97spss['R0536300'].\
    ...   map(metaspss.variable_value_labels['R0536300']).\
    ...   value_counts(normalize=True)
    Male     0.51
    Female   0.49
    Name: R0536300, dtype: float64
    >>> nls97spss = pyreadstat.set_value_labels(nls97spss, metaspss, formats_as_category=True)
  4. Use column labels in the metadata to rename the columns.

    To use the column labels from metaspss in our data frame, we can simply assign the column labels in metaspss to our data frame's column names. Clean up the column names a bit by changing them to lowercase, changing spaces to underscores, and removing all remaining non-alphanumeric characters:

    >>> nls97spss.columns = metaspss.column_labels
    >>> nls97spss['KEY!SEX (SYMBOL) 1997'].value_counts(normalize=True)
    Male     0.51
    Female   0.49
    Name: KEY!SEX (SYMBOL) 1997, dtype: float64
    >>> nls97spss.dtypes
    PUBID - YTH ID CODE 1997                        float64
    KEY!SEX (SYMBOL) 1997                          category
    KEY!BDATE M/Y (SYMBOL) 1997                     float64
    KEY!BDATE M/Y (SYMBOL) 1997                     float64
    CV_SAMPLE_TYPE 1997                            category
    KEY!RACE_ETHNICITY (SYMBOL) 1997               category
    ...
    HRS/WK R WATCHES TELEVISION 2017               category
    HRS/NIGHT R SLEEPS 2017                         float64
    CVC_WKSWK_YR_ALL L99                            float64
    dtype: object
    >>> nls97spss.columns = nls97spss.columns.\
    ...     str.lower().\
    ...     str.replace(' ','_').\
    ...     str.replace('[^a-z0-9_]', '')
    >>> nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
  5. Simplify the process by applying the value labels from the beginning.

    The data values can actually be applied in the initial call to read_sav by setting apply_value_formats to True. This eliminates the need to call the set_value_labels function later:

    >>> nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav', apply_value_formats=True, formats_as_category=True)
    >>> nls97spss.columns = metaspss.column_labels
    >>> nls97spss.columns = nls97spss.columns.\
    ...   str.lower().\
    ...   str.replace(' ','_').\
    ...   str.replace('[^a-z0-9_]', '')
  6. Show the columns and a few rows:
    >>> nls97spss.dtypes
    pubid__yth_id_code_1997                        float64
    keysex_symbol_1997                            category
    keybdate_my_symbol_1997                        float64
    keybdate_my_symbol_1997                        float64
    ...
    hrsnight_r_sleeps_2017                         float64
    cvc_wkswk_yr_all_l99                           float64
    dtype: object
    >>> nls97spss.head()
       pubid__yth_id_code_1997 keysex_symbol_1997  ...  \
    0                        1             Female  ...   
    1                        2               Male  ...   
    2                        3             Female  ...   
    3                        4             Female  ...   
    4                        5               Male  ...   
       hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99  
    0                     nan                    52  
    1                       6                     0  
    2                       6                     0  
    3                       6                     4  
    4                       5                    12  
    [5 rows x 42 columns]
  7. Run frequencies on one of the columns and set the index:
    >>> nls97spss.govt_responsibility__provide_jobs_2006.\
    ...   value_counts(sort=False)
    Definitely should be        454
    Definitely should not be    300
    Probably should be          617
    Probably should not be      462
    Name: govt_responsibility__provide_jobs_2006, dtype: int64
    >>> nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
  8. Import the Stata data, apply value labels, and improve the column headings.

    Use the same methods for the Stata data that we use for the SPSS data:

    >>> nls97stata, metastata = pyreadstat.read_dta('data/nls97.dta', apply_value_formats=True, formats_as_category=True)
    >>> nls97stata.columns = metastata.column_labels
    >>> nls97stata.columns = nls97stata.columns.\
    ...     str.lower().\
    ...     str.replace(' ','_').\
    ...     str.replace('[^a-z0-9_]', '')
    >>> nls97stata.dtypes
    pubid__yth_id_code_1997                        float64
    keysex_symbol_1997                            category
    keybdate_my_symbol_1997                        float64
    keybdate_my_symbol_1997                        float64
    ...
    hrsnight_r_sleeps_2017                         float64
    cvc_wkswk_yr_all_l99                           float64
    dtype: object
  9. View a few rows of the data and run frequency:
    >>> nls97stata.head()
       pubid__yth_id_code_1997 keysex_symbol_1997  ...  \
    0                        1             Female  ...   
    1                        2               Male  ...   
    2                        3             Female  ...   
    3                        4             Female  ...   
    4                        5               Male  ...   
       hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99  
    0                      -5                    52  
    1                       6                     0  
    2                       6                     0  
    3                       6                     4  
    4                       5                    12  
    [5 rows x 42 columns] 
    >>> nls97stata.govt_responsibility__provide_jobs_2006.\
    ...   value_counts(sort=False)
    -5.0                        1425
    -4.0                        5665
    -2.0                          56
    -1.0                           5
    Definitely should be         454
    Definitely should not be     300
    Probably should be           617
    Probably should not be       462
    Name: govt_responsibility__provide_jobs_2006, dtype: int64
  10. Fix the logical missing values that show up with the Stata data and set an index:
    >>> nls97stata.min()
    pubid__yth_id_code_1997                        1
    keysex_symbol_1997                        Female
    keybdate_my_symbol_1997                        1
    keybdate_my_symbol_1997                    1,980
    ...
    cv_bio_child_hh_2017                          -5
    cv_bio_child_nr_2017                          -5
    hrsnight_r_sleeps_2017                        -5
    cvc_wkswk_yr_all_l99                          -4
    dtype: object
    >>> nls97stata.replace(list(range(-9,0)), np.nan, inplace=True)
    >>> nls97stata.min()
    pubid__yth_id_code_1997                        1
    keysex_symbol_1997                        Female
    keybdate_my_symbol_1997                        1
    keybdate_my_symbol_1997                    1,980
    ...
    cv_bio_child_hh_2017                           0
    cv_bio_child_nr_2017                           0
    hrsnight_r_sleeps_2017                         0
    cvc_wkswk_yr_all_l99                           0
    dtype: object
    >>> nls97stata.set_index('pubid__yth_id_code_1997', inplace=True)
  11. Retrieve the SAS data, using the SAS catalog file for value labels:

    The data values for SAS are stored in a catalog file. Setting the catalog file path and filename retrieves the value labels and applies them:

    >>> nls97sas, metasas = pyreadstat.read_sas7bdat('data/nls97.sas7bdat', catalog_file='data/nlsformats3.sas7bcat', formats_as_category=True)
    >>> nls97sas.columns = metasas.column_labels
    >>> 
    >>> nls97sas.columns = nls97sas.columns.\
    ...     str.lower().\
    ...     str.replace(' ','_').\
    ...     str.replace('[^a-z0-9_]', '')
    >>> 
    >>> nls97sas.head()
       pubid__yth_id_code_1997 keysex_symbol_1997  ...  \
    0                        1             Female  ...   
    1                        2               Male  ...   
    2                        3             Female  ...   
    3                        4             Female  ...   
    4                        5               Male  ...   
       hrsnight_r_sleeps_2017  cvc_wkswk_yr_all_l99  
    0                     nan                    52  
    1                       6                     0  
    2                       6                     0  
    3                       6                     4  
    4                       5                    12  
    [5 rows x 42 columns]
    >>> nls97sas.keysex_symbol_1997.value_counts()
    Male      4599
    Female    4385
    Name: keysex_symbol_1997, dtype: int64
    >>> nls97sas.set_index('pubid__yth_id_code_1997', inplace=True)

This demonstrates how to import SPSS, SAS, and Stata data without losing important metadata.

How it works...

The read_sav, read_dta, and read_sas7bdat methods of pyreadstat, for SPSS, Stata, and SAS data files, respectively, work in a similar manner. Value labels can be applied when reading in the data by setting apply_value_formats to True for SPSS and Stata files (steps 5 and 8), or by providing a catalog file path and filename for SAS (step 11). We can set formats_as_category to True to change the data type to category for those columns where the data values will change. The meta object has the column names and the column labels from the statistical package, so metadata column labels can be assigned to pandas data frame column names at any point (nls97spss.columns = metaspss.column_labels). We can even revert to the original column headings after assigning meta column labels to them by setting pandas column names to the metadata column names (nls97spss.columns = metaspss.column_names).

In step 3, we read the SPSS data without applying value labels. We looked at the dictionary for one variable (metaspss.variable_value_labels['R0536300']), but we could have viewed it for all variables (metaspss.variable_value_labels). When we are satisfied that the labels make sense, we can set them by calling the set_value_labels function. This is a good approach when you do not know the data well and want to inspect the labels before applying them.

The column labels from the meta object are often a better choice than the original column headings. Column headings can be quite cryptic, particularly when the SPSS, Stata, or SAS file is based on a large survey, as in this example. But the labels are not usually ideal for column headings either. They sometimes have spaces, capitalization that is not helpful, and non-alphanumeric characters. We chain some string operations to switch to lowercase, replace spaces with underscores, and remove non-alphanumeric characters.

Handling missing values is not always straightforward with these data files, since there are often many reasons why data is missing. If the file is from a survey, the missing value may be because of a survey skip pattern, or a respondent failed to respond, or the response was invalid, and so on. The NLS has 9 possible values for missing, from -1 to -9. The SPSS import automatically set those values to NaN, while the Stata import retained the original values. (We could have gotten the SPSS import to retain those values by setting user_missing to True.) For the Stata data, we need to tell it to replace all values from -1 to -9 with NaN. We do this by using the data frame's replace function and passing it a list of integers from -9 to -1 (list(range(-9,0))).

There's more…

You may have noticed similarities between this recipe and the previous one in terms of how value labels are set. The set_value_labels function is like the data frame replace operation we used to set value labels in that recipe. We passed a dictionary to replace that mapped columns to value labels. The set_value_labels function in this recipe essentially does the same thing, using the variable_value_labels property of the meta object as the dictionary.

Data from statistical packages is often not as well structured as SQL databases tend to be in one significant way. Since they are designed to facilitate analysis, they often violate database normalization rules. There is often an implied relational structure that might have to be unflattened at some point. For example, the data combines individual and event level data – person and hospital visits, brown bear and date emerged from hibernation. Often, this data will need to be reshaped for some aspects of the analysis.

See also

The pyreadstat package is nicely documented at https://github.com/Roche/pyreadstat. The package has many useful options for selecting columns and handling missing data that space did not permit me to demonstrate in this recipe.

 

Importing R data

We will use pyreadr to read an R data file into pandas. Since pyreadr cannot capture the metadata, we will write code to reconstruct value labels (analogous to R factors) and column headings. This is similar to what we did in the Importing data from SQL databases recipe.

The R statistical package is, in many ways, similar to the combination of Python and pandas, at least in its scope. Both have strong tools across a range of data preparation and data analysis tasks. Some data scientists work with both R and Python, perhaps doing data manipulation in Python and statistical analysis in R, or vice-versa, depending on their preferred packages. But there is currently a scarcity of tools for reading data saved in R, as rds or rdata files, into Python. The analyst often saves the data as a CSV file first, and then loads the CSV file into Python. We will use pyreadr, from the same author as pyreadstat, because it does not require an installation of R.

When we receive an R file, or work with one we have created ourselves, we can count on it being fairly well structured, at least compared to CSV or Excel files. Each column will have only one data type, column headings will have appropriate names for Python variables, and all rows will have the same structure. However, we may need to restore some of the coding logic, as we did when working with SQL data.

Getting ready

This recipe assumes you have installed the pyreadr package. If it is not installed, you can install it with pip. From the terminal, or powershell (in Windows), enter pip install pyreadr. You will need the R rds file for this recipe in order to run the code.

We will again work with the National Longitudinal Survey in this recipe.

How to do it…

We will import data from R without losing important metadata:

  1. Load pandas, numpy, pprint, and the pyreadr package:
    >>> import pandas as pd
    >>> import numpy as np
    >>> import pyreadr
    >>> import pprint
  2. Get the R data.

    Pass the path and filename to the read_r method to retrieve the R data and load it into memory as a pandas data frame. read_r can return one or more objects. When reading an rds file (as opposed to an rdata file), it will return one object, having the key None. We indicate None to get the pandas data frame:

    >>> nls97r = pyreadr.read_r('data/nls97.rds')[None]
    >>> nls97r.dtypes
    R0000100    int32
    R0536300    int32
    ...
    U2962800    int32
    U2962900    int32
    U2963000    int32
    Z9063900    int32
    dtype: object
    >>> nls97r.head(10)
       R0000100  R0536300  R0536401  ...  U2962900  U2963000  Z9063900
    0         1         2         9  ...        -5        -5        52
    1         2         1         7  ...         2         6         0
    2         3         2         9  ...         2         6         0
    3         4         2         2  ...         2         6         4
    4         5         1        10  ...         2         5        12
    5         6         2         1  ...         2         6         6
    6         7         1         4  ...        -5        -5         0
    7         8         2         6  ...        -5        -5        39
    8         9         1        10  ...         2         4         0
    9        10         1         3  ...         2         6         0
    [10 rows x 42 columns]
  3. Set up dictionaries for value labels and column headings.

    Load a dictionary that maps columns to the value labels and create a list of preferred column names as follows:

    >>> with open('data/nlscodes.txt', 'r') as reader:
    ...     setvalues = eval(reader.read())
    ... 
    >>> pprint.pprint(setvalues)
    {'R0536300': {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'},
     'R1235800': {0.0: 'Oversample', 1.0: 'Cross-sectional'},
     'S8646900': {1.0: '1. Definitely',
                  2.0: '2. Probably ',
                  3.0: '3. Probably not',
                  4.0: '4. Definitely not'}}
    ...
    >>> newcols = ['personid','gender','birthmonth','birthyear',
    ...   'sampletype',  'category','satverbal','satmath',
    ...   'gpaoverall','gpaeng','gpamath','gpascience','govjobs',
    ...   'govprices','govhealth','goveld','govind','govunemp',
    ...   'govinc','govcollege','govhousing','govenvironment',
    ...   'bacredits','coltype1','coltype2','coltype3','coltype4',
    ...   'coltype5','coltype6','highestgrade','maritalstatus',
    ...   'childnumhome','childnumaway','degreecol1',
    ...   'degreecol2','degreecol3','degreecol4','wageincome',
    ...   'weeklyhrscomputer','weeklyhrstv',
    ...   'nightlyhrssleep','weeksworkedlastyear']
  4. Set value labels and missing values, and change selected columns to category data type.

    Use the setvalues dictionary to replace existing values with value labels. Replace all values from -9 to -1 with NaN:

    >>> nls97r.replace(setvalues, inplace=True)
    >>> nls97r.head()
       R0000100 R0536300  ...  U2963000  Z9063900
    0         1   Female  ...        -5        52
    1         2     Male  ...         6         0
    2         3   Female  ...         6         0
    3         4   Female  ...         6         4
    4         5     Male  ...         5        12
    [5 rows x 42 columns]
    >>> nls97r.replace(list(range(-9,0)), np.nan, inplace=True)
    >>> for col in nls97r[[k for k in setvalues]].columns:
    ...     nls97r[col] = nls97r[col].astype('category')
    ... 
    >>> nls97r.dtypes
    R0000100       int64
    R0536300    category
    R0536401       int64
    R0536402       int64
    R1235800    category
                  ...   
    U2857300    category
    U2962800    category
    U2962900    category
    U2963000     float64
    Z9063900     float64
    Length: 42, dtype: object
  5. Set meaningful column headings:
    >>> nls97r.columns = newcols
    >>> nls97r.dtypes
    personid                  int64
    gender                 category
    birthmonth                int64
    birthyear                 int64
    sampletype             category
                             ...   
    wageincome             category
    weeklyhrscomputer      category
    weeklyhrstv            category
    nightlyhrssleep         float64
    weeksworkedlastyear     float64
    Length: 42, dtype: object

This shows how R data files can be imported into pandas and value labels assigned.

How it works…

Reading R data into pandas with pyreadr is fairly straightforward. Passing a filename to the read_r function is all that is required. Since read_r can return multiple objects with one call, we need to specify which object. When reading an rds file (as opposed to an rdata file), only one object is returned. It has the key None.

In step 3, we load a dictionary that maps our variables to value labels, and a list for our preferred column headings. In step 4 we apply the value labels. We also change the data type to category for the columns where we applied the values. We do this by generating a list of the keys of our setvalues dictionary with [k for k in setvalues] and then iterating over those columns.

We change the column headings in step 5 to ones that are more intuitive. Note that the order matters here. We need to set the value labels before changing the column names, since the setvalues dictionary is based on the original column headings.

The main advantage of using pyreadr to read R files directly into pandas is that we do not have to convert the R data into a CSV file first. Once we have written our Python code to read the file, we can just rerun it whenever the R data changes. This is particularly helpful when we do not have R on the machine where we are working.

There's more…

pyreadr is able to return multiple data frames. This is useful when we save several data objects in R as an rdata file. We can return all of them with one call.

print is a handy tool for improving the display of Python dictionaries.

See also

Clear instructions and examples for pyreadr are available at https://github.com/ofajardo/pyreadr.

Feather files, a relatively new format, can be read by both R and Python. I discuss those files in the next recipe.

We could have used rpy2 instead of pyreadr to import R data. rpy2 requires that R also be installed, but it is more powerful than pyreadr. It will read R factors and automatically set them to pandas data frame values. See the following code:

>>> import rpy2.robjects as robjects
>>> from rpy2.robjects import pandas2ri
>>> pandas2ri.activate()
>>> readRDS = robjects.r['readRDS']
>>> nls97withvalues = readRDS('data/nls97withvalues.rds')
>>> nls97withvalues
R0000100 R0536300  R0536401  ...               U2962900    U2963000 
1     1   Female         9  ...                     NaN -2147483648       
2     2     Male         7  ...    3 to 10 hours a week           6        
3     3   Female         9  ...    3 to 10 hours a week           6       
4     4   Female         2  ...    3 to 10 hours a week           6        
5     5     Male        10  ...    3 to 10 hours a week           5       
...  ...      ...       ...  ...                    ...         ...      
8980 9018  Female         3  ...   3 to 10 hours a week           4       
8981 9019    Male         9  ...   3 to 10 hours a week           6        
8982 9020    Male         7  ...                    NaN -2147483648       
8983 9021    Male         7  ...   3 to 10 hours a week           7       
8984 9022  Female         1  ...Less than 2 hours per week        7       
[8984 rows x 42 columns]

This generates an unusual -2147483648 values. This is what happened when readRDS interpreted missing data in numeric columns. A global replace of that number with NaN, after confirming that that is not a valid value, would be a good next step.

 

Persisting tabular data

We persist data, copy it from memory to local or remote storage, for several reasons: to be able to access the data without having to repeat the steps we used to generate it; to share the data with others; or to make it available for use with different software. In this recipe, we save data that we have loaded into a pandas data frame as different file types (CSV, Excel, pickle, and feather).

Another important, but sometimes overlooked, reason to persist data is to preserve some segment of our data that needs to be examined more closely; perhaps it needs to be scrutinized by others before our analysis can be completed. For analysts who work with operational data in medium- to large-sized organizations, this process is part of the daily data cleaning workflow.

In addition to these reasons for persisting data, our decisions about when and how to serialize data are shaped by several other factors: where we are in terms of our data analysis projects, the hardware and software resources of the machine(s) saving and reloading the data, and the size of our dataset. Analysts end up having to be much more intentional when saving data than they are when pressing Ctrl + S in their word processing applications.

Once we persist data, it is stored separately from the logic that we used to create it. I find this to be one of the most important threats to the integrity of our analysis. Often, we end up loading data that we saved some time in the past (a week ago? a month ago? a year ago?) and forget how a variable was defined and how it relates to other variables. If we are in the middle of a data cleaning task, it is best not to persist our data, so long as our workstation and network can easily handle the burden of regenerating the data. It is a good idea to persist data only once we have reached milestones in our work.

Beyond the question of when to persist data, there is the question of how. If we are persisting it for our own reuse with the same software, it is best to save it in a binary format native to that software. That is pretty straightforward for tools such as SPSS, SAS, Stata, and R, but not so much for pandas. But that is good news in a way. We have lots of choices, from CSV and Excel to pickle and feather. We save to all these file types in this recipe.

Getting ready

You will need to install feather if you do not have it on your system. You can do that by entering pip install pyarrow in a terminal window or powershell (in Windows). If you do not already have a subfolder named Views in your chapter 1 folder, you will need to create it in order to run the code for this recipe.

Note

This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-4. This is just a 100,000-row sample of the full dataset, which is also available in the repository.

How to do it…

We will load a CSV file into pandas and then save it as a pickle file and as a feather file. We will also save subsets of the data in CSV and Excel formats:

  1. Import pandas and pyarrow and adjust the display.

    Pyarrow needs to be imported in order to save pandas to feather:

    >>> import pandas as pd
    >>> import pyarrow
  2. Load the land temperatures CSV file into pandas, drop rows with missing data, and set an index:
    >>> landtemps = pd.read_csv('data/landtempssample.csv',
    ...     names=['stationid','year','month','avgtemp','latitude',
    ...       'longitude','elevation','station','countryid','country'],
    ...     skiprows=1,
    ...     parse_dates=[['month','year']],
    ...     low_memory=False)
    >>> 
    >>> landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
    >>> landtemps.dropna(subset=['avgtemp'], inplace=True)
    >>> landtemps.dtypes
    measuredate    datetime64[ns]
    stationid              object
    avgtemp               float64
    latitude              float64
    longitude             float64
    elevation             float64
    station                object
    countryid              object
    country                object
    dtype: object
    >>> landtemps.set_index(['measuredate','stationid'], inplace=True)
  3. Write extreme values for temperature to CSV and Excel files.

    Use the quantile method to select outlier rows, those at the 1-in-1,000 level at each end of the distribution:

    >>> extremevals = landtemps[(landtemps.avgtemp < landtemps.avgtemp.quantile(.001)) | (landtemps.avgtemp > landtemps.avgtemp.quantile(.999))]
    >>> extremevals.shape
    (171, 7)
    >>> extremevals.sample(7)
                             avgtemp  ...  country
    measuredate stationid             ...         
    2013-08-01  QAM00041170    35.30  ...    Qatar
    2005-01-01  RSM00024966   -40.09  ...   Russia
    1973-03-01  CA002401200   -40.26  ...   Canada
    2007-06-01  KU000405820    37.35  ...   Kuwait
    1987-07-01  SUM00062700    35.50  ...    Sudan
    1998-02-01  RSM00025325   -35.71  ...   Russia
    1968-12-01  RSM00024329   -43.20  ...   Russia
    [7 rows x 7 columns]
    >>> extremevals.to_excel('views/tempext.xlsx')
    >>> extremevals.to_csv('views/tempext.csv')
  4. Save to pickle and feather files.

    The index needs to be reset in order to save a feather file:

    >>> landtemps.to_pickle('data/landtemps.pkl')
    >>> landtemps.reset_index(inplace=True)
    >>> landtemps.to_feather("data/landtemps.ftr")
  5. Load the pickle and feather files we just saved.

    Notice that our index was preserved when saving and loading the pickle file:

    >>> landtemps = pd.read_pickle('data/landtemps.pkl')
    >>> landtemps.head(2).T
    measuredate     2000-04-01  1940-05-01
    stationid      USS0010K01S CI000085406
    avgtemp               5.27       18.04
    latitude             39.90      -18.35
    longitude          -110.75      -70.33
    elevation         2,773.70       58.00
    station      INDIAN_CANYON       ARICA
    countryid               US          CI
    country      United States       Chile
    >>> landtemps = pd.read_feather("data/landtemps.ftr")
    >>> landtemps.head(2).T
                                   0                    1
    measuredate  2000-04-01 00:00:00  1940-05-01 00:00:00
    stationid            USS0010K01S          CI000085406
    avgtemp                     5.27                18.04
    latitude                   39.90               -18.35
    longitude                -110.75               -70.33
    elevation               2,773.70                58.00
    station            INDIAN_CANYON                ARICA
    countryid                     US                   CI
    country            United States                Chile

The previous steps demonstrate how to serialize pandas data frames using two different formats, pickle and feather.

How it works...

Persisting pandas data is fairly straightforward. Data frames have to_csv, to_excel, to_pickle, and to_feather methods. Pickling preserves our index.

There's more...

The advantage of storing data in CSV files is that saving it uses up very little additional memory. The disadvantage is that writing CSV files is quite slow and we lose important metadata, such as data types. (read_csv can often figure out the data type when we reload the file, but not always.) Pickle files keep that data, but can burden a system that is low on resources when serializing. Feather is easier on resources, and can be easily loaded in R as well as Python, but we have to sacrifice our index in order to serialize. Also, the authors of feather make no promises regarding long-term support.

You may have noticed that I do not make a recommendation about what to use for data serialization – other than to limit your persistence of full datasets to project milestones. This is definitely one of those "right tools for the right job" kind of situations. I use CSV or Excel files when I want to share a segment of a file with colleagues for discussion. I use feather for ongoing Python projects, particularly when I am using a machine with sub-par RAM and an outdated chip, and I am also using R. When I am wrapping up a project, I pickle the data frames.

About the Author

  • Michael Walker

    Michael Walker has worked as a data analyst for over 30 years at a variety of educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He generates public sector and foundation reports and conducts analyses for publication in academic journals.

    Browse publications by this author

Latest Reviews

(3 reviews total)
Weel organized, easy to follow.
Excellent reference library
Gr8!!!!!!!!!!!!!!!!!!!!!!!
Book Title
Unlock this book and the full library for FREE
Start free trial