Importing R data
We will use pyreadr to read an R data file into pandas. Since pyreadr cannot capture the metadata, we will write code to reconstruct value labels (analogous to R factors) and column headings. This is similar to what we did in the Importing data from SQL databases recipe.
The R statistical package is, in many ways, similar to the combination of Python and pandas, at least in its scope. Both have strong tools across a range of data preparation and data analysis tasks. Some data scientists work with both R and Python, perhaps doing data manipulation in Python and statistical analysis in R, or vice versa, depending on their preferred packages. However, there is currently a scarcity of tools for reading data saved in R, as rds or rdata files, into Python. The analyst often saves the data as a CSV file first and then loads it into Python. We will use pyreadr, from the same author as pyreadstat, because it does not require an installation of R.
When we receive an R file, or work with one we have created ourselves, we can count on it being fairly well structured, at least compared to CSV or Excel files. Each column will have only one data type, column headings will have appropriate names for Python variables, and all rows will have the same structure. However, we may need to restore some of the coding logic, as we did when working with SQL data.
Getting ready
This recipe assumes you have installed the pyreadr package. If it is not installed, you can install it with pip. From the Terminal, or Powershell (in Windows), enter pip install pyreadr.
We will again work with the NLS in this recipe. You will need to download the rds file used in this recipe from the GitHub repository in order to run the code.
How to do it…
We will import data from R without losing important metadata:
- Load
pandas,numpy,pprint, and thepyreadrpackage:import pandas as pd import numpy as np import pyreadr import pprint - Get the R data.
Pass the path and filename to the read_r method to retrieve the R data, and load it into memory as a pandas DataFrame. read_r can return one or more objects. When reading an rds file (as opposed to an rdata file), it will return one object, having the key None. We indicate None to get the pandas DataFrame:
nls97r = pyreadr.read_r('data/nls97.rds')[None]
nls97r.dtypes
R0000100 int32
R0536300 int32
...
U2962800 int32
U2962900 int32
U2963000 int32
Z9063900 int32
dtype: object
nls97r.head(10)
R0000100 R0536300 ... U2963000 Z9063900
0 1 2 ... -5 52
1 2 1 ... 6 0
2 3 2 ... 6 0
3 4 2 ... 6 4
4 5 1 ... 5 12
5 6 2 ... 6 6
6 7 1 ... -5 0
7 8 2 ... -5 39
8 9 1 ... 4 0
9 10 1 ... 6 0
[10 rows x 42 columns]
- Set up dictionaries for value labels and column headings.
Load a dictionary that maps columns to the value labels and create a list of preferred column names as follows:
with open('data/nlscodes.txt', 'r') as reader:
... setvalues = eval(reader.read())
...
pprint.pprint(setvalues)
{'R0536300': {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'},
'R1235800': {0.0: 'Oversample', 1.0: 'Cross-sectional'},
'S8646900': {1.0: '1. Definitely',
2.0: '2. Probably ',
3.0: '3. Probably not',
4.0: '4. Definitely not'}}
...abbreviated to save space
newcols = ['personid','gender','birthmonth',
... 'birthyear','sampletype','category',
... 'satverbal','satmath','gpaoverall',
... 'gpaeng','gpamath','gpascience','govjobs',
... 'govprices','govhealth','goveld','govind',
... 'govunemp','govinc','govcollege',
... 'govhousing','govenvironment','bacredits',
... 'coltype1','coltype2','coltype3','coltype4',
... 'coltype5','coltype6','highestgrade',
... 'maritalstatus','childnumhome','childnumaway',
... 'degreecol1','degreecol2','degreecol3',
... 'degreecol4','wageincome','weeklyhrscomputer',
... 'weeklyhrstv','nightlyhrssleep',
... 'weeksworkedlastyear']
- Set value labels and missing values, and change selected columns to the
categorydata type.
Use the setvalues dictionary to replace existing values with value labels. Replace all values from –9 to –1 with NaN:
nls97r.replace(setvalues, inplace=True)
nls97r.head()
R0000100 R0536300 ... U2963000 Z9063900
0 1 Female ... -5 52
1 2 Male ... 6 0
2 3 Female ... 6 0
3 4 Female ... 6 4
4 5 Male ... 5 12
[5 rows x 42 columns]
nls97r.replace(list(range(-9,0)), np.nan, inplace=True)
for col in nls97r[[k for k in setvalues]].columns:
... nls97r[col] = nls97r[col].astype('category')
...
nls97r.dtypes
R0000100 int64
R0536300 category
R0536401 int64
R0536402 int64
R1235800 category
...
U2857300 category
U2962800 category
U2962900 category
U2963000 float64
Z9063900 float64
Length: 42, dtype: object
- Set meaningful column headings:
nls97r.columns = newcols nls97r.dtypespersonid int64 gender category birthmonth int64 birthyear int64 sampletype category ... wageincome category weeklyhrscomputer category weeklyhrstv category nightlyhrssleep float64 weeksworkedlastyear float64 Length: 42, dtype: object
This shows how R data files can be imported into pandas and value labels assigned.
How it works…
Reading R data into pandas with pyreadr is fairly straightforward. Passing a filename to the read_r function is all that is required. Since read_r can return multiple objects with one call, we need to specify which object. When reading an rds file (as opposed to an rdata file), only one object is returned. It has the key None.
In Step 3, we loaded a dictionary that maps our variables to value labels, and a list for our preferred column headings. In Step 4 we applied the value labels. We also changed the data type to category for the columns where we applied the values. We did this by generating a list of the keys in our setvalues dictionary with [k for k in setvalues] and then iterating over those columns.
We change the column headings in Step 5 to ones that are more intuitive. Note that the order matters here. We need to set the value labels before changing the column names, since the setvalues dictionary is based on the original column headings.
The main advantage of using pyreadr to read R files directly into pandas is that we do not have to convert the R data into a CSV file first. Once we have written our Python code to read the file, we can just rerun it whenever the R data changes. This is particularly helpful when we do not have R on the machine where we work.
There’s more…
Pyreadr is able to return multiple DataFrames. This is useful when we save several data objects in R as an rdata file. We can return all of them with one call.
Pprint is a handy tool for improving the display of Python dictionaries.
We could have used rpy2 instead of pyreadr to import R data. rpy2 requires that R also be installed, but it is more powerful than pyreadr. It will read R factors and automatically set them to pandas DataFrame values. See the following code:
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
readRDS = robjects.r['readRDS']
nls97withvalues = readRDS('data/nls97withvalues.rds')
nls97withvalues
R0000100 R0536300 ... U2963000 Z9063900
1 1 Female ... -2147483648 52
2 2 Male ... 6 0
3 3 Female ... 6 0
4 4 Female ... 6 4
5 5 Male ... 5 12
... ... ... ... ... ...
8980 9018 Female ... 4 49
8981 9019 Male ... 6 0
8982 9020 Male ... -2147483648 15
8983 9021 Male ... 7 50
8984 9022 Female ... 7 20
[8984 rows x 42 columns]
This generates unusual –2147483648 values. This is what happened when readRDS interpreted missing data in numeric columns. A global replacement of that number with NaN, after confirming that that is not a valid value, would be a good next step.
See also
Clear instructions and examples for pyreadr are available at https://github.com/ofajardo/pyreadr.
Feather files, a relatively new format, can be read by both R and Python. I discuss those files in the next recipe.