Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas
Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python. We work with pandas extensively in this book, along with popular packages such as numpy
, matplotlib
, and scipy
.
A key pandas object is the data frame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores we discuss in this chapter. However, a pandas data frame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward, as the recipes in this book will demonstrate.
Before we can make use of this great functionality, we have to get our data into pandas. Data comes to us in a wide variety of formats: as CSV or Excel files, as tables from SQL databases, from statistical analysis packages such as SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web pages.
We examine tools for importing tabular data in this recipe. Specifically, we cover the following topics:
- Importing CSV files
- Importing Excel files
- Importing data from SQL databases
- Importing SPSS, Stata, and SAS data
- Importing R data
- Persisting tabular data
Technical requirements
The code and notebooks for this chapter are available on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook
Importing CSV files
The read_csv
method of the pandas
library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas data frame. In this recipe, we read a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data.
Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.
All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas read_csv
method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.
Getting ready
Create a folder for this chapter and create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder and place the landtempssample.csv
file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository. Here is a code sample from the beginning of the CSV file:
locationid,year,month,temp,latitude,longitude,stnelev,station,countryid,country USS0010K01S,2000,4,5.27,39.9,-110.75,2773.7,INDIAN_CANYON,US,United States CI000085406,1940,5,18.04,-18.35,-70.333,58.0,ARICA,CI,Chile USC00036376,2013,12,6.22,34.3703,-91.1242,61.0,SAINT_CHARLES,US,United States ASN00024002,1963,2,22.93,-34.2833,140.6,65.5,BERRI_IRRIGATION,AS,Australia ASN00028007,2001,11,,-14.7803,143.5036,79.4,MUSGRAVE,AS,Australia
Note
This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-4. This is just a 100,000-row sample of the full dataset, which is also available in the repository.
How to do it…
We will import a CSV file into pandas, taking advantage of some very useful read_csv
options:
- Import the
pandas
library and set up the environment to make viewing the output easier:>>> import pandas as pd >>> pd.options.display.float_format = '{:,.2f}'.format >>> pd.set_option('display.width', 85) >>> pd.set_option('display.max_columns', 8)
- Read the data file, set new names for the headings, and parse the date column.
Pass an argument of
1
to theskiprows
parameter to skip the first row, pass a list of columns toparse_dates
to create a pandas datetime column from those columns, and setlow_memory
toFalse
to reduce the usage of memory during the import process:>>> landtemps = pd.read_csv('data/landtempssample.csv', ... names=['stationid','year','month','avgtemp','latitude', ... 'longitude','elevation','station','countryid','country'], ... skiprows=1, ... parse_dates=[['month','year']], ... low_memory=False) >>> type(landtemps) <class 'pandas.core.frame.DataFrame'>
- Get a quick glimpse of the data.
View the first few rows. Show the data type for all columns, as well as the number of rows and columns:
>>> landtemps.head(7) month_year stationid ... countryid country 0 2000-04-01 USS0010K01S ... US United States 1 1940-05-01 CI000085406 ... CI Chile 2 2013-12-01 USC00036376 ... US United States 3 1963-02-01 ASN00024002 ... AS Australia 4 2001-11-01 ASN00028007 ... AS Australia 5 1991-04-01 USW00024151 ... US United States 6 1993-12-01 RSM00022641 ... RS Russia [7 rows x 9 columns] >>> landtemps.dtypes month_year datetime64[ns] stationid object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object >>> landtemps.shape (100000, 9)
- Give the date column a better name and view the summary statistics for average monthly temperature:
>>> landtemps.rename(columns={'month_year':'measuredate'}, inplace=True) >>> landtemps.dtypes measuredate datetime64[ns] stationid object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object >>> landtemps.avgtemp.describe() count 85,554.00 mean 10.92 std 11.52 min -70.70 25% 3.46 50% 12.22 75% 19.57 max 39.95 Name: avgtemp, dtype: float64
- Look for missing values for each column.
Use
isnull
, which returnsTrue
for each value that is missing for each column, andFalse
when not missing. Chain this withsum
to count the missings for each column. (When working with Boolean values,sum
treatsTrue
as1
andFalse
as0
. I will discuss method chaining in the There's more... section of this recipe):>>> landtemps.isnull().sum() measuredate 0 stationid 0 avgtemp 14446 latitude 0 longitude 0 elevation 0 station 0 countryid 0 country 5 dtype: int64
- Remove rows with missing data for
avgtemp
.Use the
subset
parameter to telldropna
to drop rows whereavgtemp
is missing. Setinplace
toTrue
. Leavinginplace
at its default value ofFalse
would display the data frame, but the changes we have made would not be retained. Use theshape
attribute of the data frame to get the number of rows and columns:>>> landtemps.dropna(subset=['avgtemp'], inplace=True) >>> landtemps.shape (85554, 9)
That's it! Importing CSV files into pandas is as simple as that.
How it works...
Almost all of the recipes in this book use the pandas
library. We refer to it as pd
to make it easier to reference later. This is customary. We also use float_format
to display float values in a readable way and set_option
to make the terminal output wide enough to accommodate the number of variables.
Much of the work is done by the first line in step 2. We use read_csv
to load a pandas data frame in memory and call it landtemps
. In addition to passing a filename, we set the names
parameter to a list of our preferred column headings. We also tell read_csv
to skip the first row, by setting skiprows
to 1, since the original column headings are in the first row of the CSV file. If we do not tell it to skip the first row, read_csv
will treat the header row in the file as actual data.
read_csv
also solves a date conversion issue for us. We use the parse_dates
parameter to ask it to convert the month
and year
columns to a date value.
Step 3 runs through a few standard data checks. We use head(7)
to print out all columns for the first 7 rows. We use the dtypes
attribute of the data frame to show the data type of all columns. Each column has the expected data type. In pandas, character data has the object data type, a data type that allows for mixed values. shape
returns a tuple, whose first element is the number of rows in the data frame (100,000 in this case) and whose second element is the number of columns (9).
When we used read_csv
to parse the month
and year
columns, it gave the resulting column the name month_year
. We use the rename
method in step 4 to give that column a better name. We need to specify inplace=True
to replace the old column name with the new column name in memory. The describe
method provides summary statistics on the avgtemp
column.
Notice that the count for avgtemp
indicates that there are 85,554 rows that have valid values for avgtemp
. This is out of 100,000 rows for the whole data frame, as provided by the shape
attribute. The listing of missing values for each column in step 5 (landtemps.isnull().sum()
) confirms this: 100,000 – 85,554 = 14,446.
Step 6 drops all rows where avgtemp
is NaN
. (The NaN
value, not a number, is the pandas representation of missing values.) subset
is used to indicate which column to check for missings. The shape
attribute for landtemps
now indicates that there are 85,554 rows, which is what we would expect given the previous count from describe
.
There's more...
If the file you are reading uses a delimiter other than a comma, such as a tab, this can be specified in the sep
parameter of read_csv
. When creating the pandas data frame, an index was also created. The numbers to the far left of the output when head
and sample
were run are index values. Any number of rows can be specified for head
or sample
. The default value is 5
.
Setting low_memory
to False
causes read_csv
to parse data in chunks. This is easier on systems with lower memory when working with larger files. However, the full data frame will still be loaded into memory once read_csv
completes successfully.
The landtemps.isnull().sum()
statement is an example of chaining methods. First, isnull
returns a data frame of True
and False
values, resulting from testing whether each column value is null
. sum
takes that data frame and sums the True
values for each column, interpreting the True
values as 1
and the False
values as 0
. We would have obtained the same result if we had used the following two steps:
>>> checknull = landtemps.isnull() >>> checknull.sum()
There is no hard and fast rule for when to chain methods and when not to. I find it helpful to chain when I really think of something I am doing as being a single step, but only two or more steps, mechanically speaking. Chaining also has the side benefit of not creating extra objects that I might not need.
The dataset used in this recipe is just a sample from the full land temperatures database with almost 17 million records. You can run the larger file if your machine can handle it, with the following code:
>>> landtemps = pd.read_csv('data/landtemps.zip', compression='zip', ... names=['stationid','year','month','avgtemp','latitude', ... 'longitude','elevation','station','countryid','country'], ... skiprows=1, ... parse_dates=[['month','year']], ... low_memory=False)
read_csv
can read a compressed ZIP file. We get it to do this by passing the name of the ZIP file and the type of compression.
See also
Subsequent recipes in this chapter, and in other chapters, set indexes to improve navigation over rows and merging.
A significant amount of reshaping of the Global Historical Climatology Network raw data was done before using it in this recipe. We demonstrate this in Chapter 8, Addressing Data Issues when Combining DataFrames. That recipe also shows how to read a text file that is not delimited, one that is fixed, by using read_fwf
.
Importing Excel files
The read_excel
method of the pandas
library can be used to import data from an Excel file and load it into memory as a pandas data frame. In this recipe, we import an Excel file and handle some common issues when working with Excel files: extraneous header and footer information, selecting specific columns, removing rows with no data, and connecting to particular sheets.
Despite the tabular structure of Excel, which invites the organization of data into rows and columns, spreadsheets are not datasets and do not require people to store data in that way. Even when some data conforms to those expectations, there is often additional information in rows or columns before or after the data to be imported. Data types are not always as clear as they are to the person who created the spreadsheet. This will be all too familiar to anyone who has ever battled with importing leading zeros. Moreover, Excel does not insist that all data in a column be of the same type, or that column headings be appropriate for use with a programming language such as Python.
Fortunately, read_excel
has a number of options for handling messiness in Excel data. These options make it relatively easy to skip rows and select particular columns, and to pull data from a particular sheet or sheets.
Getting ready
You can download the GDPpercapita.xlsx
file, as well as the code for this recipe, from the GitHub repository for this book. The code assumes that the Excel file is in a data subfolder. Here is a view of the beginning of the file:

Figure 1.1 – View of the dataset
And here is a view of the end of the file:

Figure 1.2 – View of the dataset
Note
This dataset, from the Organisation for Economic Co-operation and Development, is available for public use at https://stats.oecd.org/.
How to do it…
We import an Excel file into pandas and do some initial data cleaning:
- Import the
pandas
library:>>> import pandas as pd
- Read the Excel per capita GDP data.
Select the sheet with the data we need, but skip the columns and rows that we do not want. Use the
sheet_name
parameter to specify the sheet. Setskiprows
to4
andskipfooter
to1
to skip the first four rows (the first row is hidden) and the last row. We provide values forusecols
to get data from columnA
and columnsC
throughT
(columnB
is blank). Usehead
to view the first few rows:>>> percapitaGDP = pd.read_excel("data/GDPpercapita.xlsx", ... sheet_name="OECD.Stat export", ... skiprows=4, ... skipfooter=1, ... usecols="A,C:T") >>> percapitaGDP.head() Year 2001 ... 2017 2018 0 Metropolitan areas NaN ... NaN NaN 1 AUS: Australia .. ... .. .. 2 AUS01: Greater Sydney 43313 ... 50578 49860 3 AUS02: Greater Melbourne 40125 ... 43025 42674 4 AUS03: Greater Brisbane 37580 ... 46876 46640 [5 rows x 19 columns]
- Use the
info
method of the data frame to view data types and thenon-null
count:>>> percapitaGDP.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 702 entries, 0 to 701 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 702 non-null object 1 2001 701 non-null object 2 2002 701 non-null object 3 2003 701 non-null object 4 2004 701 non-null object 5 2005 701 non-null object 6 2006 701 non-null object 7 2007 701 non-null object 8 2008 701 non-null object 9 2009 701 non-null object 10 2010 701 non-null object 11 2011 701 non-null object 12 2012 701 non-null object 13 2013 701 non-null object 14 2014 701 non-null object 15 2015 701 non-null object 16 2016 701 non-null object 17 2017 701 non-null object 18 2018 701 non-null object dtypes: object(19) memory usage: 104.3+ KB
- Rename the
Year
column tometro
and remove the leading spaces.Give an appropriate name to the metropolitan area column. There are extra spaces before the metro values in some cases, and extra spaces after the metro values in others. We can test for leading spaces with
startswith(' ')
and then useany
to establish whether there are one or more occasions when the first character is blank. We can useendswith(' ')
to examine trailing spaces. We usestrip
to remove both leading and trailing spaces:>>> percapitaGDP.rename(columns={'Year':'metro'}, inplace=True) >>> percapitaGDP.metro.str.startswith(' ').any() True >>> percapitaGDP.metro.str.endswith(' ').any() True >>> percapitaGDP.metro = percapitaGDP.metro.str.strip()
- Convert the data columns to numeric.
Iterate over all of the GDP year columns (2001-2018) and convert the data type from
object
tofloat
. Coerce the conversion even when there is character data – the..
in this example. We want character values in those columns to become missing, which is what happens. Rename the year columns to better reflect the data in those columns:>>> for col in percapitaGDP.columns[1:]: ... percapitaGDP[col] = pd.to_numeric(percapitaGDP[col], errors='coerce') ... percapitaGDP.rename(columns={col:'pcGDP'+col}, inplace=True) ... >>> percapitaGDP.head() metro pcGDP2001 ... pcGDP2017 pcGDP2018 0 Metropolitan areas nan ... nan nan 1 AUS: Australia nan ... nan nan 2 AUS01: Greater Sydney 43313 ... 50578 49860 3 AUS02: Greater Melbourne 40125 ... 43025 42674 4 AUS03: Greater Brisbane 37580 ... 46876 46640 >>> percapitaGDP.dtypes metro object pcGDP2001 float64 pcGDP2002 float64 abbreviated to save space pcGDP2017 float64 pcGDP2018 float64 dtype: object
- Use the
describe
method to generate summary statistics for all numeric data in the data frame:>>> percapitaGDP.describe() pcGDP2001 pcGDP2002 ... pcGDP2017 pcGDP2018 count 424 440 ... 445 441 mean 41264 41015 ... 47489 48033 std 11878 12537 ... 15464 15720 min 10988 11435 ... 2745 2832 25% 33139 32636 ... 37316 37908 50% 39544 39684 ... 45385 46057 75% 47972 48611 ... 56023 56638 max 91488 93566 ... 122242 127468 [8 rows x 18 columns]
- Remove rows where all of the per capita GDP values are missing.
Use the
subset
parameter ofdropna
to inspect all columns, starting with the second column (it is zero-based) through the last column. Usehow
to specify that we want to drop rows only if all of the columns specified insubset
are missing. Useshape
to show the number of rows and columns in the resulting data frame:>>> percapitaGDP.dropna(subset=percapitaGDP.columns[1:], how="all", inplace=True) >>> percapitaGDP.describe() pcGDP2001 pcGDP2002 ... pcGDP2017 pcGDP2018 count 424 440 ... 445 441 mean 41264 41015 ... 47489 48033 std 11878 12537 ... 15464 15720 min 10988 11435 ... 2745 2832 25% 33139 32636 ... 37316 37908 50% 39544 39684 ... 45385 46057 75% 47972 48611 ... 56023 56638 max 91488 93566 ... 122242 127468 [8 rows x 18 columns] >>> percapitaGDP.head() metro pcGDP2001 ... pcGDP2017 pcGDP2018 2 AUS01: Greater Sydney 43313 ... 50578 49860 3 AUS02: Greater Melbourne 40125 ... 43025 42674 4 AUS03: Greater Brisbane 37580 ... 46876 46640 5 AUS04: Greater Perth 45713 ... 66424 70390 6 AUS05: Greater Adelaide 36505 ... 40115 39924 [5 rows x 19 columns] >>> percapitaGDP.shape (480, 19)
- Set the index for the data frame using the metropolitan area column.
Confirm that there are 480 valid values for
metro
and that there are 480 unique values, before setting the index:>>> percapitaGDP.metro.count() 480 >>> percapitaGDP.metro.nunique() 480 >>> percapitaGDP.set_index('metro', inplace=True) >>> percapitaGDP.head() pcGDP2001 pcGDP2002 ... pcGDP2017 pcGDP2018 metro ... AUS01: Greater Sydney 43313 44008 ... 50578 49860 AUS02: Greater Melbourne 40125 40894 ... 43025 42674 AUS03: Greater Brisbane 37580 37564 ... 46876 46640 AUS04: Greater Perth 45713 47371 ... 66424 70390 AUS05: Greater Adelaide 36505 37194 ... 40115 39924 [5 rows x 18 columns] >>> percapitaGDP.loc['AUS02: Greater Melbourne'] pcGDP2001 40125 pcGDP2002 40894 ... pcGDP2017 43025 pcGDP2018 42674 Name: AUS02: Greater Melbourne, dtype: float64
We have now imported the Excel data into a pandas data frame and cleaned up some of the messiness in the spreadsheet.
How it works…
We mostly manage to get the data we want in step 2 by skipping rows and columns we do not want, but there are still a number of issues: read_excel
interprets all of the GDP data as character data, many rows are loaded with no useful data, and the column names do not represent the data well. In addition, the metropolitan area column might be useful as an index, but there are leading and trailing blanks and there may be missing or duplicated values.
read_excel
interprets Year
as the column name for the metropolitan area data because it looks for a header above the data for that Excel column and finds Year
there. We rename that column metro
in step 4. We also use strip
to fix the problem with leading and trailing blanks. If there had only been leading blanks, we could have used lstrip
, or rstrip
if there had only been trailing blanks. It is a good idea to assume that there might be leading or trailing blanks in any character data and clean that data shortly after the initial import.
The spreadsheet authors used ..
to represent missing data. Since this is actually valid character data, those columns get the object data type (how pandas treats columns with character or mixed data). We coerce a conversion to numeric in step 5. This also results in the original values of ..
being replaced with NaN
(not a number), pandas' value for missing numbers. This is what we want.
We can fix all of the per capita GDP columns with just a few lines because pandas makes it easy to iterate over the columns of a data frame. By specifying [1:]
, we iterate from the second column to the last column. We can then change those columns to numeric and rename them to something more appropriate.
There are several reasons why it is a good idea to clean up the column headings for the annual GDP columns: it helps us to remember what the data actually is; if we merge it with other data by metropolitan area, we will not have to worry about conflicting variable names; and we can use attribute access to work with pandas series based on those columns, which I will discuss in more detail in the There's more… section of this recipe.
describe
in step 6 shows us that only between 420 and 480 rows have valid data for per capita GDP. When we drop all rows that have missing values for all per capita GDP columns in step 7, we end up with 480 rows in the data frame, which is what we expected.
There's more…
Once we have a pandas data frame, we have the ability to treat columns as more than just columns. We can use attribute access (such as percapitaGPA.metro
) or bracket notation (percapitaGPA['metro']
) to get the functionality of a pandas data series. Either method makes it possible to use data series string-inspecting methods such as str.startswith
, and counting methods such as nunique
. Note that the original column names of 20##
did not allow for attribute access because they started with a number, so percapitaGDP.pcGDP2001.count()
works, but percapitaGDP.2001.count()
returns a syntax error because 2001
is not a valid Python identifier (since it starts with a number).
Pandas is rich with features for string manipulation and for data series operations. We will try many of them out in subsequent recipes. This recipe showed those I find most useful when importing Excel data.
See also
There are good reasons to consider reshaping this data. Instead of 18 columns of GDP per capita data for each metropolitan area, we should have 18 rows of data for each metropolitan area, with columns for year and GDP per capita. Recipes for reshaping data can be found in Chapter 9, Tidying and Reshaping Data.
Importing data from SQL databases
In this recipe, we will use pymssql
and mysql apis
to read data from Microsoft SQL Server and MySQL (now owned by Oracle) databases, respectively. Data from sources such as these tends to be well structured since it is designed to facilitate simultaneous transactions by members of organizations, and those who interact with them. Each transaction is also likely related to some other organizational transaction.
This means that although data tables from enterprise systems are more reliably structured than data from CSV files and Excel files, their logic is less likely to be self-contained. You need to know how the data from one table relates to data from another table to understand its full meaning. These relationships need to be preserved, including the integrity of primary and foreign keys, when pulling data. Moreover, well-structured data tables are not necessarily uncomplicated data tables. There are often sophisticated coding schemes that determine data values, and these coding schemes can change over time. For example, codes for staff ethnicity at a retail store chain might be different in 1998 than they are in 2020. Similarly, frequently there are codes for missing values, such as 99999
, that pandas will understand as valid values.
Since much of this logic is business logic, and implemented in stored procedures or other applications, it is lost when pulled out of this larger system. Some of what is lost will eventually have to be reconstructed when preparing data for analysis. This almost always involves combining data from multiple tables, so it is important to preserve the ability to do that. But it also may involve adding some of the coding logic back after loading the SQL table into a pandas data frame. We explore how to do that in this recipe.
Getting ready
This recipe assumes you have the pymssql
and mysql
APIs installed. If you do not, it is relatively straightforward to install them with pip
. From the terminal, or PowerShell (in Windows), enter pip install pymssql
or pip install mysql-connector-python
.
Note
The dataset used in this recipe is available for public use at https://archive.ics.uci.edu/ml/machine-learning-databases/00320/.
How to do it...
We import SQL Server and MySQL data tables into a pandas data frame as follows:
- Import
pandas
,numpy
,pymssql
, andmysql
.This step assumes that you have installed the
pymssql
andmysql
APIs:>>> import pandas as pd >>> import numpy as np >>> import pymssql >>> import mysql.connector
- Use the
pymssql
API andread_sql
to retrieve and load data from a SQL Server instance.Select the columns we want from the SQL Server data and use SQL aliases to improve column names (for example,
fedu AS fathereducation
). Create a connection to the SQL Server data by passing database credentials to thepymssql
connect function. Create a pandas data frame by passing theselect
statement andconnection
object toread_sql
. Close the connection to return it to the pool on the server:>>> query = "SELECT studentid, school, sex, age, famsize,\ ... medu AS mothereducation, fedu AS fathereducation,\ ... traveltime, studytime, failures, famrel, freetime,\ ... goout, g1 AS gradeperiod1, g2 AS gradeperiod2,\ ... g3 AS gradeperiod3 From studentmath" >>> >>> server = "pdcc.c9sqqzd5fulv.us-west-2.rds.amazonaws.com" >>> user = "pdccuser" >>> password = "pdccpass" >>> database = "pdcctest" >>> >>> conn = pymssql.connect(server=server, ... user=user, password=password, database=database) >>> >>> studentmath = pd.read_sql(query,conn) >>> conn.close()
- Check the data types and the first few rows:
>>> studentmath.dtypes studentid object school object sex object age int64 famsize object mothereducation int64 fathereducation int64 traveltime int64 studytime int64 failures int64 famrel int64 freetime int64 goout int64 gradeperiod1 int64 gradeperiod2 int64 gradeperiod3 int64 dtype: object >>> studentmath.head() studentid school ... gradeperiod2 gradeperiod3 0 001 GP ... 6 6 1 002 GP ... 5 6 2 003 GP ... 8 10 3 004 GP ... 14 15 4 005 GP ... 10 10 [5 rows x 16 columns]
- (Alternative) Use the
mysql
connector andread_sql
to get data from MySQL.Create a connection to the
mysql
data and pass that connection toread_sql
to retrieve the data and load it into a pandas data frame. (The same data file on student math scores was uploaded to SQL Server and MySQL, so we can use the same SQL SELECT statement we used in the previous step.):>>> host = "pdccmysql.c9sqqzd5fulv.us-west-2.rds.amazonaws.com" >>> user = "pdccuser" >>> password = "pdccpass" >>> database = "pdccschema" >>> connmysql = mysql.connector.connect(host=host, ... database=database,user=user,password=password) >>> studentmath = pd.read_sql(sqlselect,connmysql) >>> connmysql.close()
- Rearrange the columns, set an index, and check for missing values.
Move the grade data to the left of the data frame, just after
studentid
. Also move thefreetime
column to the right aftertraveltime
andstudytime
. Confirm that each row has an ID and that the IDs are unique, and setstudentid
as the index:>>> newcolorder = ['studentid', 'gradeperiod1', 'gradeperiod2', ... 'gradeperiod3', 'school', 'sex', 'age', 'famsize', ... 'mothereducation', 'fathereducation', 'traveltime', ... 'studytime', 'freetime', 'failures', 'famrel', ... 'goout'] >>> studentmath = studentmath[newcolorder] >>> studentmath.studentid.count() 395 >>> studentmath.studentid.nunique() 395 >>> studentmath.set_index('studentid', inplace=True)
- Use the data frame's
count
function to check for missing values:>>> studentmath.count() gradeperiod1 395 gradeperiod2 395 gradeperiod3 395 school 395 sex 395 age 395 famsize 395 mothereducation 395 fathereducation 395 traveltime 395 studytime 395 freetime 395 failures 395 famrel 395 goout 395 dtype: int64
- Replace coded data values with more informative values.
Create a dictionary with the replacement values for the columns, and then use
replace
to set those values:>>> setvalues={"famrel":{1:"1:very bad",2:"2:bad",3:"3:neutral", ... 4:"4:good",5:"5:excellent"}, ... "freetime":{1:"1:very low",2:"2:low",3:"3:neutral", ... 4:"4:high",5:"5:very high"}, ... "goout":{1:"1:very low",2:"2:low",3:"3:neutral", ... 4:"4:high",5:"5:very high"}, ... "mothereducation":{0:np.nan,1:"1:k-4",2:"2:5-9", ... 3:"3:secondary ed",4:"4:higher ed"}, ... "fathereducation":{0:np.nan,1:"1:k-4",2:"2:5-9", ... 3:"3:secondary ed",4:"4:higher ed"}} >>> studentmath.replace(setvalues, inplace=True) >>> setvalueskeys = [k for k in setvalues]
- Change the type for columns with the changed data to
category
.Check for any changes in memory usage:
>>> studentmath[setvalueskeys].memory_usage(index=False) famrel 3160 freetime 3160 goout 3160 mothereducation 3160 fathereducation 3160 dtype: int64 >>> for col in studentmath[setvalueskeys].columns: ... studentmath[col] = studentmath[col].astype('category') ... >>> studentmath[setvalueskeys].memory_usage(index=False) famrel 595 freetime 595 goout 595 mothereducation 587 fathereducation 587 dtype: int64
- Calculate percentages for values in the
famrel
column.Run
value_counts
and setnormalize
toTrue
to generate percentages:>>> studentmath['famrel'].value_counts(sort=False, normalize=True) 1:very bad 0.02 2:bad 0.05 3:neutral 0.17 4:good 0.49 5:excellent 0.27 Name: famrel, dtype: float64
- Use
apply
to calculate percentages for multiple columns:>>> studentmath[['freetime','goout']].\ ... apply(pd.Series.value_counts, sort=False, normalize=True) freetime goout 1:very low 0.05 0.06 2:low 0.16 0.26 3:neutral 0.40 0.33 4:high 0.29 0.22 5:very high 0.10 0.13 >>> >>> studentmath[['mothereducation','fathereducation']].\ ... apply(pd.Series.value_counts, sort=False, normalize=True) mothereducation fathereducation 1:k-4 0.15 0.21 2:5-9 0.26 0.29 3:secondary ed 0.25 0.25 4:higher ed 0.33 0.24
The preceding steps retrieved a data table from a SQL database, loaded that data into pandas, and did some initial data checking and cleaning.
How it works…
Since data from enterprise systems is typically better structured than CSV or Excel files, we do not need to do things such as skip rows or deal with different logical data types in a column. But some massaging is still usually required before we can begin exploratory analysis. There are often more columns than we need, and some column names are not intuitive or not ordered in the best way for analysis. The meaningfulness of many data values is not stored in the data table, to avoid entry errors and save on storage space. For example, 3
is stored for mother's education
rather than secondary education
. It is a good idea to reconstruct that coding as early in the cleaning process as possible.
To pull data from a SQL database server, we need a connection object to authenticate us on the server, and a SQL select string. These can be passed to read_sql
to retrieve the data and load it into a pandas data frame. I usually use the SQL SELECT
statement to do a bit of cleanup of column names at this point. I sometimes also reorder columns, but I do that later in this recipe.
We set the index in step 5, first confirming that every row has a value for studentid
and that it is unique. This is often more important when working with enterprise data because we will almost always need to merge the retrieved data with other data files on the system. Although an index is not required for this merging, the discipline of setting one prepares us for the tricky business of merging data down the road. It will also likely improve the speed of the merge.
We use the data frame's count
function to check for missing values and there are no missing values – non-missing values is 395 (the number of rows) for every column. This is almost too good to be true. There may be values that are logically missing; that is, valid numbers that nonetheless connote missing values, such as -1, 0, 9, or 99. We address this possibility in the next step.
Step 7 demonstrates a useful technique for replacing data values for multiple columns. We create a dictionary to map original values to new values for each column, and then run it using replace
. To reduce the amount of storage space taken up by the new verbose values, we convert the data type of those columns to category
. We do this by generating a list of the keys of our setvalues
dictionary – setvalueskeys = [k for k in setvalues]
generates [famrel
, freetime
, goout
, mothereducation
, and fathereducation
]. We then iterate over those five columns and use the astype
method to change the data type to category
. Notice that the memory usage for those columns is reduced substantially.
Finally, we check the assignment of new values by using value_counts
to view relative frequencies. We use apply
because we want to run value_counts
on multiple columns. To avoid value_counts
sorting by frequency, we set sort to False
.
The data frame replace
method is also a handy tool for dealing with logical missing values that will not be recognized as missing when retrieved by read_sql
. 0
values for mothereducation
and fathereducation
seem to fall into that category. We fix this problem in the setvalues
dictionary by indicating that 0
values for mothereducation
and fathereducation
should be replaced with NaN
. It is important to address these kinds of missing values shortly after the initial import because they are not always obvious and can significantly impact all subsequent work.
Users of packages such as SPPS, SAS, and R will notice the difference between this approach and value labels in SPSS and R, and proc
format
in SAS. In pandas, we need to change the actual data to get more informative values. However, we reduce how much data is actually stored by giving the column a category data type, similar to factors in R.
There's more…
I moved the grade data to near the beginning of the data frame. I find it helpful to have potential target or dependent variables in the leftmost columns, to keep them at the forefront of my thinking. It is also helpful to keep similar columns together. In this example, personal demographic variables (sex, age) are next to one another, as are family variables (mothereducation
, fathereducation
), and how students spend their time (traveltime
, studytime
, and freetime
).
You could have used map
instead of replace
in step 7. Prior to version 19.2 of pandas, map
was significantly more efficient. Since then, the difference in efficiency has been much smaller. If you are working with a very large dataset, the difference may still be enough to consider using map.
See also
The recipes in Chapter 8, Addressing Data Issues when Combining DataFrames, go into detail on merging data. We will take a closer look at bivariate and multivariate relationships between variables in Chapter 4, Identifying Missing Values and Outliers in Subsets of Data. We demonstrate how to use some of these same approaches in packages such as SPSS, SAS, and R in subsequent recipes in this chapter.
Importing SPSS, Stata, and SAS data
We will use pyreadstat
to read data from three popular statistical packages into pandas. The key advantage of pyreadstat
is that it allows data analysts to import data from these packages without losing metadata, such as variable and value labels.
The SPSS, Stata, and SAS data files we receive often come to us with the data issues of CSV and Excel files and SQL databases having been resolved. We do not typically have the invalid column names, changes in data types, and unclear missing values that we can get with CSV or Excel files, nor do we usually get the detachment of data from business logic, such as the meaning of data codes, that we often get with SQL data. When someone or some organization shares a data file from one of these packages with us, they have often added variable labels and value labels for categorical data. For example, a hypothetical data column called presentsat
has the variable label overall satisfaction with presentation
and value labels 1-5, with 1
being not at all satisfied and 5
being highly satisfied.
The challenge is retaining that metadata when importing data from those systems into pandas. There is no precise equivalent to variable and value labels in pandas, and built-in tools for importing SAS, Stata, and SAS data lose the metadata. In this recipe, we will use pyreadstat
to load variable and value label information and use a couple of techniques for representing that information in pandas.
Getting ready
This recipe assumes you have installed the pyreadstat
package. If it is not installed, you can install it with pip
. From the terminal, or PowerShell (in Windows), enter pip install pyreadstat
. You will need the SPSS, Stata, and SAS data files for this recipe to run the code.
We will work with data from the United States National Longitudinal Survey of Youth (NLS).
Note
The National Longitudinal Survey of Youth is conducted by the United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997 who were born between 1980 and 1985, with annual follow-ups each year through 2017. For this recipe, I pulled 42 variables on grades, employment, income, and attitudes toward government, from the hundreds of data items on the survey. Separate files for SPSS, Stata, and SAS can be downloaded from the repository. NLS data can be downloaded from https://www.nlsinfo.org/investigator/pages/search.
How to do it...
We will import data from SPSS, Stata, and SAS, retaining metadata such as value labels:
- Import
pandas
,numpy
, andpyreadstat
.This step assumes that you have installed
pyreadstat
:>>> import pandas as pd >>> import numpy as np >>> import pyreadstat
- Retrieve the SPSS data.
Pass a path and filename to the
read_sav
method ofpyreadstat
. Display the first few rows and a frequency distribution. Notice that the column names and value labels are non-descriptive, and thatread_sav
creates both a pandas data frame and a meta object:>>> nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav') >>> nls97spss.dtypes R0000100 float64 R0536300 float64 R0536401 float64 ... U2962900 float64 U2963000 float64 Z9063900 float64 dtype: object >>> nls97spss.head() R0000100 R0536300 ... U2963000 Z9063900 0 1 2 ... nan 52 1 2 1 ... 6 0 2 3 2 ... 6 0 3 4 2 ... 6 4 4 5 1 ... 5 12 [5 rows x 42 columns] >>> nls97spss['R0536300'].value_counts(normalize=True) 1.00 0.51 2.00 0.49 Name: R0536300, dtype: float64
- Grab the metadata to improve column labels and value labels.
The
metaspss
object created when we calledread_sav
has the column labels and the value labels from the SPSS file. Use thevariable_value_labels
dictionary to map values to value labels for one column (R0536300
). (This does not change the data. It only improves our display when we runvalue_counts
.) Use theset_value_labels
method to actually apply the value labels to the data frame:>>> metaspss.variable_value_labels['R0536300'] {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'} >>> nls97spss['R0536300'].\ ... map(metaspss.variable_value_labels['R0536300']).\ ... value_counts(normalize=True) Male 0.51 Female 0.49 Name: R0536300, dtype: float64 >>> nls97spss = pyreadstat.set_value_labels(nls97spss, metaspss, formats_as_category=True)
- Use column labels in the metadata to rename the columns.
To use the column labels from
metaspss
in our data frame, we can simply assign the column labels inmetaspss
to our data frame's column names. Clean up the column names a bit by changing them to lowercase, changing spaces to underscores, and removing all remaining non-alphanumeric characters:>>> nls97spss.columns = metaspss.column_labels >>> nls97spss['KEY!SEX (SYMBOL) 1997'].value_counts(normalize=True) Male 0.51 Female 0.49 Name: KEY!SEX (SYMBOL) 1997, dtype: float64 >>> nls97spss.dtypes PUBID - YTH ID CODE 1997 float64 KEY!SEX (SYMBOL) 1997 category KEY!BDATE M/Y (SYMBOL) 1997 float64 KEY!BDATE M/Y (SYMBOL) 1997 float64 CV_SAMPLE_TYPE 1997 category KEY!RACE_ETHNICITY (SYMBOL) 1997 category ... HRS/WK R WATCHES TELEVISION 2017 category HRS/NIGHT R SLEEPS 2017 float64 CVC_WKSWK_YR_ALL L99 float64 dtype: object >>> nls97spss.columns = nls97spss.columns.\ ... str.lower().\ ... str.replace(' ','_').\ ... str.replace('[^a-z0-9_]', '') >>> nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
- Simplify the process by applying the value labels from the beginning.
The data values can actually be applied in the initial call to
read_sav
by settingapply_value_formats
toTrue
. This eliminates the need to call theset_value_labels
function later:>>> nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav', apply_value_formats=True, formats_as_category=True) >>> nls97spss.columns = metaspss.column_labels >>> nls97spss.columns = nls97spss.columns.\ ... str.lower().\ ... str.replace(' ','_').\ ... str.replace('[^a-z0-9_]', '')
- Show the columns and a few rows:
>>> nls97spss.dtypes pubid__yth_id_code_1997 float64 keysex_symbol_1997 category keybdate_my_symbol_1997 float64 keybdate_my_symbol_1997 float64 ... hrsnight_r_sleeps_2017 float64 cvc_wkswk_yr_all_l99 float64 dtype: object >>> nls97spss.head() pubid__yth_id_code_1997 keysex_symbol_1997 ... \ 0 1 Female ... 1 2 Male ... 2 3 Female ... 3 4 Female ... 4 5 Male ... hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99 0 nan 52 1 6 0 2 6 0 3 6 4 4 5 12 [5 rows x 42 columns]
- Run
frequencies
on one of the columns and set the index:>>> nls97spss.govt_responsibility__provide_jobs_2006.\ ... value_counts(sort=False) Definitely should be 454 Definitely should not be 300 Probably should be 617 Probably should not be 462 Name: govt_responsibility__provide_jobs_2006, dtype: int64 >>> nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
- Import the Stata data, apply value labels, and improve the column headings.
Use the same methods for the Stata data that we use for the SPSS data:
>>> nls97stata, metastata = pyreadstat.read_dta('data/nls97.dta', apply_value_formats=True, formats_as_category=True) >>> nls97stata.columns = metastata.column_labels >>> nls97stata.columns = nls97stata.columns.\ ... str.lower().\ ... str.replace(' ','_').\ ... str.replace('[^a-z0-9_]', '') >>> nls97stata.dtypes pubid__yth_id_code_1997 float64 keysex_symbol_1997 category keybdate_my_symbol_1997 float64 keybdate_my_symbol_1997 float64 ... hrsnight_r_sleeps_2017 float64 cvc_wkswk_yr_all_l99 float64 dtype: object
- View a few rows of the data and run
frequency
:>>> nls97stata.head() pubid__yth_id_code_1997 keysex_symbol_1997 ... \ 0 1 Female ... 1 2 Male ... 2 3 Female ... 3 4 Female ... 4 5 Male ... hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99 0 -5 52 1 6 0 2 6 0 3 6 4 4 5 12 [5 rows x 42 columns] >>> nls97stata.govt_responsibility__provide_jobs_2006.\ ... value_counts(sort=False) -5.0 1425 -4.0 5665 -2.0 56 -1.0 5 Definitely should be 454 Definitely should not be 300 Probably should be 617 Probably should not be 462 Name: govt_responsibility__provide_jobs_2006, dtype: int64
- Fix the logical missing values that show up with the Stata data and set an index:
>>> nls97stata.min() pubid__yth_id_code_1997 1 keysex_symbol_1997 Female keybdate_my_symbol_1997 1 keybdate_my_symbol_1997 1,980 ... cv_bio_child_hh_2017 -5 cv_bio_child_nr_2017 -5 hrsnight_r_sleeps_2017 -5 cvc_wkswk_yr_all_l99 -4 dtype: object >>> nls97stata.replace(list(range(-9,0)), np.nan, inplace=True) >>> nls97stata.min() pubid__yth_id_code_1997 1 keysex_symbol_1997 Female keybdate_my_symbol_1997 1 keybdate_my_symbol_1997 1,980 ... cv_bio_child_hh_2017 0 cv_bio_child_nr_2017 0 hrsnight_r_sleeps_2017 0 cvc_wkswk_yr_all_l99 0 dtype: object >>> nls97stata.set_index('pubid__yth_id_code_1997', inplace=True)
- Retrieve the SAS data, using the SAS catalog file for value labels:
The data values for SAS are stored in a catalog file. Setting the catalog file path and filename retrieves the value labels and applies them:
>>> nls97sas, metasas = pyreadstat.read_sas7bdat('data/nls97.sas7bdat', catalog_file='data/nlsformats3.sas7bcat', formats_as_category=True) >>> nls97sas.columns = metasas.column_labels >>> >>> nls97sas.columns = nls97sas.columns.\ ... str.lower().\ ... str.replace(' ','_').\ ... str.replace('[^a-z0-9_]', '') >>> >>> nls97sas.head() pubid__yth_id_code_1997 keysex_symbol_1997 ... \ 0 1 Female ... 1 2 Male ... 2 3 Female ... 3 4 Female ... 4 5 Male ... hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99 0 nan 52 1 6 0 2 6 0 3 6 4 4 5 12 [5 rows x 42 columns] >>> nls97sas.keysex_symbol_1997.value_counts() Male 4599 Female 4385 Name: keysex_symbol_1997, dtype: int64 >>> nls97sas.set_index('pubid__yth_id_code_1997', inplace=True)
This demonstrates how to import SPSS, SAS, and Stata data without losing important metadata.
How it works...
The read_sav
, read_dta
, and read_sas7bdat
methods of pyreadstat
, for SPSS, Stata, and SAS data files, respectively, work in a similar manner. Value labels can be applied when reading in the data by setting apply_value_formats
to True
for SPSS and Stata files (steps 5 and 8), or by providing a catalog file path and filename for SAS (step 11). We can set formats_as_category
to True
to change the data type to category
for those columns where the data values will change. The meta object has the column names and the column labels from the statistical package, so metadata column labels can be assigned to pandas data frame column names at any point (nls97spss.columns = metaspss.column_labels
). We can even revert to the original column headings after assigning meta column labels to them by setting pandas column names to the metadata column names (nls97spss.columns = metaspss.column_names
).
In step 3, we read the SPSS data without applying value labels. We looked at the dictionary for one variable (metaspss.variable_value_labels['R0536300']
), but we could have viewed it for all variables (metaspss.variable_value_labels
). When we are satisfied that the labels make sense, we can set them by calling the set_value_labels
function. This is a good approach when you do not know the data well and want to inspect the labels before applying them.
The column labels from the meta object are often a better choice than the original column headings. Column headings can be quite cryptic, particularly when the SPSS, Stata, or SAS file is based on a large survey, as in this example. But the labels are not usually ideal for column headings either. They sometimes have spaces, capitalization that is not helpful, and non-alphanumeric characters. We chain some string operations to switch to lowercase, replace spaces with underscores, and remove non-alphanumeric characters.
Handling missing values is not always straightforward with these data files, since there are often many reasons why data is missing. If the file is from a survey, the missing value may be because of a survey skip pattern, or a respondent failed to respond, or the response was invalid, and so on. The NLS has 9 possible values for missing, from -1 to -9. The SPSS import automatically set those values to NaN
, while the Stata import retained the original values. (We could have gotten the SPSS import to retain those values by setting user_missing
to True
.) For the Stata data, we need to tell it to replace all values from -1 to -9 with NaN
. We do this by using the data frame's replace
function and passing it a list of integers from -9 to -1 (list(range(-9,0))
).
There's more…
You may have noticed similarities between this recipe and the previous one in terms of how value labels are set. The set_value_labels
function is like the data frame replace
operation we used to set value labels in that recipe. We passed a dictionary to replace
that mapped columns to value labels. The set_value_labels
function in this recipe essentially does the same thing, using the variable_value_labels
property of the meta object as the dictionary.
Data from statistical packages is often not as well structured as SQL databases tend to be in one significant way. Since they are designed to facilitate analysis, they often violate database normalization rules. There is often an implied relational structure that might have to be unflattened at some point. For example, the data combines individual and event level data – person and hospital visits, brown bear and date emerged from hibernation. Often, this data will need to be reshaped for some aspects of the analysis.
See also
The pyreadstat
package is nicely documented at https://github.com/Roche/pyreadstat. The package has many useful options for selecting columns and handling missing data that space did not permit me to demonstrate in this recipe.
Importing R data
We will use pyreadr
to read an R data file into pandas. Since pyreadr
cannot capture the metadata, we will write code to reconstruct value labels (analogous to R factors) and column headings. This is similar to what we did in the Importing data from SQL databases recipe.
The R statistical package is, in many ways, similar to the combination of Python and pandas, at least in its scope. Both have strong tools across a range of data preparation and data analysis tasks. Some data scientists work with both R and Python, perhaps doing data manipulation in Python and statistical analysis in R, or vice-versa, depending on their preferred packages. But there is currently a scarcity of tools for reading data saved in R, as rds
or rdata
files, into Python. The analyst often saves the data as a CSV file first, and then loads the CSV file into Python. We will use pyreadr
, from the same author as pyreadstat
, because it does not require an installation of R.
When we receive an R file, or work with one we have created ourselves, we can count on it being fairly well structured, at least compared to CSV or Excel files. Each column will have only one data type, column headings will have appropriate names for Python variables, and all rows will have the same structure. However, we may need to restore some of the coding logic, as we did when working with SQL data.
Getting ready
This recipe assumes you have installed the pyreadr
package. If it is not installed, you can install it with pip
. From the terminal, or powershell
(in Windows), enter pip install pyreadr
. You will need the R rds
file for this recipe in order to run the code.
We will again work with the National Longitudinal Survey in this recipe.
How to do it…
We will import data from R without losing important metadata:
- Load
pandas
,numpy
,pprint
, and thepyreadr
package:>>> import pandas as pd >>> import numpy as np >>> import pyreadr >>> import pprint
- Get the R data.
Pass the path and filename to the
read_r
method to retrieve the R data and load it into memory as a pandas data frame.read_r
can return one or more objects. When reading anrds
file (as opposed to anrdata
file), it will return one object, having the keyNone
. We indicateNone
to get the pandas data frame:>>> nls97r = pyreadr.read_r('data/nls97.rds')[None] >>> nls97r.dtypes R0000100 int32 R0536300 int32 ... U2962800 int32 U2962900 int32 U2963000 int32 Z9063900 int32 dtype: object >>> nls97r.head(10) R0000100 R0536300 R0536401 ... U2962900 U2963000 Z9063900 0 1 2 9 ... -5 -5 52 1 2 1 7 ... 2 6 0 2 3 2 9 ... 2 6 0 3 4 2 2 ... 2 6 4 4 5 1 10 ... 2 5 12 5 6 2 1 ... 2 6 6 6 7 1 4 ... -5 -5 0 7 8 2 6 ... -5 -5 39 8 9 1 10 ... 2 4 0 9 10 1 3 ... 2 6 0 [10 rows x 42 columns]
- Set up dictionaries for value labels and column headings.
Load a dictionary that maps columns to the value labels and create a list of preferred column names as follows:
>>> with open('data/nlscodes.txt', 'r') as reader: ... setvalues = eval(reader.read()) ... >>> pprint.pprint(setvalues) {'R0536300': {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}, 'R1235800': {0.0: 'Oversample', 1.0: 'Cross-sectional'}, 'S8646900': {1.0: '1. Definitely', 2.0: '2. Probably ', 3.0: '3. Probably not', 4.0: '4. Definitely not'}} ... >>> newcols = ['personid','gender','birthmonth','birthyear', ... 'sampletype', 'category','satverbal','satmath', ... 'gpaoverall','gpaeng','gpamath','gpascience','govjobs', ... 'govprices','govhealth','goveld','govind','govunemp', ... 'govinc','govcollege','govhousing','govenvironment', ... 'bacredits','coltype1','coltype2','coltype3','coltype4', ... 'coltype5','coltype6','highestgrade','maritalstatus', ... 'childnumhome','childnumaway','degreecol1', ... 'degreecol2','degreecol3','degreecol4','wageincome', ... 'weeklyhrscomputer','weeklyhrstv', ... 'nightlyhrssleep','weeksworkedlastyear']
- Set value labels and missing values, and change selected columns to category data type.
Use the
setvalues
dictionary to replace existing values with value labels. Replace all values from -9 to -1 withNaN
:>>> nls97r.replace(setvalues, inplace=True) >>> nls97r.head() R0000100 R0536300 ... U2963000 Z9063900 0 1 Female ... -5 52 1 2 Male ... 6 0 2 3 Female ... 6 0 3 4 Female ... 6 4 4 5 Male ... 5 12 [5 rows x 42 columns] >>> nls97r.replace(list(range(-9,0)), np.nan, inplace=True) >>> for col in nls97r[[k for k in setvalues]].columns: ... nls97r[col] = nls97r[col].astype('category') ... >>> nls97r.dtypes R0000100 int64 R0536300 category R0536401 int64 R0536402 int64 R1235800 category ... U2857300 category U2962800 category U2962900 category U2963000 float64 Z9063900 float64 Length: 42, dtype: object
- Set meaningful column headings:
>>> nls97r.columns = newcols >>> nls97r.dtypes personid int64 gender category birthmonth int64 birthyear int64 sampletype category ... wageincome category weeklyhrscomputer category weeklyhrstv category nightlyhrssleep float64 weeksworkedlastyear float64 Length: 42, dtype: object
This shows how R data files can be imported into pandas and value labels assigned.
How it works…
Reading R data into pandas with pyreadr
is fairly straightforward. Passing a filename to the read_r
function is all that is required. Since read_r
can return multiple objects with one call, we need to specify which object. When reading an rds
file (as opposed to an rdata
file), only one object is returned. It has the key None
.
In step 3, we load a dictionary that maps our variables to value labels, and a list for our preferred column headings. In step 4 we apply the value labels. We also change the data type to category
for the columns where we applied the values. We do this by generating a list of the keys of our setvalues
dictionary with [k for k in setvalues]
and then iterating over those columns.
We change the column headings in step 5 to ones that are more intuitive. Note that the order matters here. We need to set the value labels before changing the column names, since the setvalues
dictionary is based on the original column headings.
The main advantage of using pyreadr
to read R files directly into pandas is that we do not have to convert the R data into a CSV file first. Once we have written our Python code to read the file, we can just rerun it whenever the R data changes. This is particularly helpful when we do not have R on the machine where we are working.
There's more…
pyreadr
is able to return multiple data frames. This is useful when we save several data objects in R as an rdata
file. We can return all of them with one call.
print
is a handy tool for improving the display of Python dictionaries.
See also
Clear instructions and examples for pyreadr
are available at https://github.com/ofajardo/pyreadr.
Feather files, a relatively new format, can be read by both R and Python. I discuss those files in the next recipe.
We could have used rpy2
instead of pyreadr
to import R data. rpy2
requires that R also be installed, but it is more powerful than pyreadr
. It will read R factors and automatically set them to pandas data frame values. See the following code:
>>> import rpy2.robjects as robjects >>> from rpy2.robjects import pandas2ri >>> pandas2ri.activate() >>> readRDS = robjects.r['readRDS'] >>> nls97withvalues = readRDS('data/nls97withvalues.rds') >>> nls97withvalues R0000100 R0536300 R0536401 ... U2962900 U2963000 1 1 Female 9 ... NaN -2147483648 2 2 Male 7 ... 3 to 10 hours a week 6 3 3 Female 9 ... 3 to 10 hours a week 6 4 4 Female 2 ... 3 to 10 hours a week 6 5 5 Male 10 ... 3 to 10 hours a week 5 ... ... ... ... ... ... ... 8980 9018 Female 3 ... 3 to 10 hours a week 4 8981 9019 Male 9 ... 3 to 10 hours a week 6 8982 9020 Male 7 ... NaN -2147483648 8983 9021 Male 7 ... 3 to 10 hours a week 7 8984 9022 Female 1 ...Less than 2 hours per week 7 [8984 rows x 42 columns]
This generates an unusual -2147483648 values. This is what happened when readRDS
interpreted missing data in numeric columns. A global replace of that number with NaN
, after confirming that that is not a valid value, would be a good next step.
Persisting tabular data
We persist data, copy it from memory to local or remote storage, for several reasons: to be able to access the data without having to repeat the steps we used to generate it; to share the data with others; or to make it available for use with different software. In this recipe, we save data that we have loaded into a pandas data frame as different file types (CSV, Excel, pickle, and feather).
Another important, but sometimes overlooked, reason to persist data is to preserve some segment of our data that needs to be examined more closely; perhaps it needs to be scrutinized by others before our analysis can be completed. For analysts who work with operational data in medium- to large-sized organizations, this process is part of the daily data cleaning workflow.
In addition to these reasons for persisting data, our decisions about when and how to serialize data are shaped by several other factors: where we are in terms of our data analysis projects, the hardware and software resources of the machine(s) saving and reloading the data, and the size of our dataset. Analysts end up having to be much more intentional when saving data than they are when pressing Ctrl + S in their word processing applications.
Once we persist data, it is stored separately from the logic that we used to create it. I find this to be one of the most important threats to the integrity of our analysis. Often, we end up loading data that we saved some time in the past (a week ago? a month ago? a year ago?) and forget how a variable was defined and how it relates to other variables. If we are in the middle of a data cleaning task, it is best not to persist our data, so long as our workstation and network can easily handle the burden of regenerating the data. It is a good idea to persist data only once we have reached milestones in our work.
Beyond the question of when to persist data, there is the question of how. If we are persisting it for our own reuse with the same software, it is best to save it in a binary format native to that software. That is pretty straightforward for tools such as SPSS, SAS, Stata, and R, but not so much for pandas. But that is good news in a way. We have lots of choices, from CSV and Excel to pickle and feather. We save to all these file types in this recipe.
Getting ready
You will need to install feather if you do not have it on your system. You can do that by entering pip install pyarrow
in a terminal window or powershell
(in Windows). If you do not already have a subfolder named Views in your chapter 1
folder, you will need to create it in order to run the code for this recipe.
Note
This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-4. This is just a 100,000-row sample of the full dataset, which is also available in the repository.
How to do it…
We will load a CSV file into pandas and then save it as a pickle file and as a feather file. We will also save subsets of the data in CSV and Excel formats:
- Import
pandas
andpyarrow
and adjust the display.Pyarrow
needs to be imported in order to save pandas to feather:>>> import pandas as pd >>> import pyarrow
- Load the land temperatures CSV file into pandas, drop rows with missing data, and set an index:
>>> landtemps = pd.read_csv('data/landtempssample.csv', ... names=['stationid','year','month','avgtemp','latitude', ... 'longitude','elevation','station','countryid','country'], ... skiprows=1, ... parse_dates=[['month','year']], ... low_memory=False) >>> >>> landtemps.rename(columns={'month_year':'measuredate'}, inplace=True) >>> landtemps.dropna(subset=['avgtemp'], inplace=True) >>> landtemps.dtypes measuredate datetime64[ns] stationid object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object >>> landtemps.set_index(['measuredate','stationid'], inplace=True)
- Write extreme values for temperature to CSV and Excel files.
Use the
quantile
method to select outlier rows, those at the 1-in-1,000 level at each end of the distribution:>>> extremevals = landtemps[(landtemps.avgtemp < landtemps.avgtemp.quantile(.001)) | (landtemps.avgtemp > landtemps.avgtemp.quantile(.999))] >>> extremevals.shape (171, 7) >>> extremevals.sample(7) avgtemp ... country measuredate stationid ... 2013-08-01 QAM00041170 35.30 ... Qatar 2005-01-01 RSM00024966 -40.09 ... Russia 1973-03-01 CA002401200 -40.26 ... Canada 2007-06-01 KU000405820 37.35 ... Kuwait 1987-07-01 SUM00062700 35.50 ... Sudan 1998-02-01 RSM00025325 -35.71 ... Russia 1968-12-01 RSM00024329 -43.20 ... Russia [7 rows x 7 columns] >>> extremevals.to_excel('views/tempext.xlsx') >>> extremevals.to_csv('views/tempext.csv')
- Save to pickle and feather files.
The index needs to be reset in order to save a feather file:
>>> landtemps.to_pickle('data/landtemps.pkl') >>> landtemps.reset_index(inplace=True) >>> landtemps.to_feather("data/landtemps.ftr")
- Load the pickle and feather files we just saved.
Notice that our index was preserved when saving and loading the pickle file:
>>> landtemps = pd.read_pickle('data/landtemps.pkl') >>> landtemps.head(2).T measuredate 2000-04-01 1940-05-01 stationid USS0010K01S CI000085406 avgtemp 5.27 18.04 latitude 39.90 -18.35 longitude -110.75 -70.33 elevation 2,773.70 58.00 station INDIAN_CANYON ARICA countryid US CI country United States Chile >>> landtemps = pd.read_feather("data/landtemps.ftr") >>> landtemps.head(2).T 0 1 measuredate 2000-04-01 00:00:00 1940-05-01 00:00:00 stationid USS0010K01S CI000085406 avgtemp 5.27 18.04 latitude 39.90 -18.35 longitude -110.75 -70.33 elevation 2,773.70 58.00 station INDIAN_CANYON ARICA countryid US CI country United States Chile
The previous steps demonstrate how to serialize pandas data frames using two different formats, pickle and feather.
How it works...
Persisting pandas data is fairly straightforward. Data frames have to_csv
, to_excel
, to_pickle
, and to_feather
methods. Pickling preserves our index.
There's more...
The advantage of storing data in CSV files is that saving it uses up very little additional memory. The disadvantage is that writing CSV files is quite slow and we lose important metadata, such as data types. (read_csv
can often figure out the data type when we reload the file, but not always.) Pickle files keep that data, but can burden a system that is low on resources when serializing. Feather is easier on resources, and can be easily loaded in R as well as Python, but we have to sacrifice our index in order to serialize. Also, the authors of feather make no promises regarding long-term support.
You may have noticed that I do not make a recommendation about what to use for data serialization – other than to limit your persistence of full datasets to project milestones. This is definitely one of those "right tools for the right job" kind of situations. I use CSV or Excel files when I want to share a segment of a file with colleagues for discussion. I use feather for ongoing Python projects, particularly when I am using a machine with sub-par RAM and an outdated chip, and I am also using R. When I am wrapping up a project, I pickle the data frames.