Importing R data
We will use pyreadr to read an R data file into pandas. Since pyreadr cannot capture the metadata, we will write code to reconstruct value labels (analogous to R factors) and column headings. This is similar to what we did in the Importing data from SQL databases recipe.
The R statistical package is, in many ways, similar to the combination of Python and pandas, at least in its scope. Both have strong tools across a range of data preparation and data analysis tasks. Some data scientists work with both R and Python, perhaps doing data manipulation in Python and statistical analysis in R, or vice versa, depending on their preferred packages. However, there is currently a scarcity of tools for reading data saved in R, as rds or rdata files, into Python. The analyst often saves the data as a CSV file first and then loads it into Python. We will use pyreadr, from the same author as pyreadstat, because it does not require an installation of R.
When we receive an R file, or work with one we have created ourselves, we can count on it being fairly well structured, at least compared to CSV or Excel files. Each column will have only one data type, column headings will have appropriate names for Python variables, and all rows will have the same structure. However, we may need to restore some of the coding logic, as we did when working with SQL data.
Getting ready
This recipe assumes you have installed the pyreadr package. If it is not installed, you can install it with pip. From the Terminal, or Powershell (in Windows), enter pip install pyreadr.
We will again work with the NLS in this recipe. You will need to download the rds file used in this recipe from the GitHub repository in order to run the code.
How to do it…
We will import data from R without losing important metadata:
- Load 
pandas,numpy,pprint, and thepyreadrpackage:import pandas as pd import numpy as np import pyreadr import pprint - Get the R data.
 
Pass the path and filename to the read_r method to retrieve the R data, and load it into memory as a pandas DataFrame. read_r can return one or more objects. When reading an rds file (as opposed to an rdata file), it will return one object, having the key None. We indicate None to get the pandas DataFrame:
nls97r = pyreadr.read_r('data/nls97.rds')[None]
nls97r.dtypes
    R0000100	int32
R0536300	int32
...
U2962800	int32
U2962900	int32
U2963000	int32
Z9063900	int32
dtype: object
    nls97r.head(10)
         R0000100  R0536300  ...      U2963000    Z9063900
0	1	  2         ...      -5          52
1	2	  1         ...       6          0
2	3	  2         ...       6          0
3	4	  2         ...       6          4
4	5	  1         ...       5          12
5	6	  2         ...       6          6
6	7	  1         ...      -5          0
7	8	  2         ...      -5          39
8	9	  1         ...       4          0
9	10	  1         ...       6          0
[10 rows x 42 columns]
    - Set up dictionaries for value labels and column headings.
 
Load a dictionary that maps columns to the value labels and create a list of preferred column names as follows:
with open('data/nlscodes.txt', 'r') as reader:
...     setvalues = eval(reader.read())
...
pprint.pprint(setvalues)
    {'R0536300': {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'},
 'R1235800': {0.0: 'Oversample', 1.0: 'Cross-sectional'},
 'S8646900': {1.0: '1. Definitely',
              2.0: '2. Probably ',
              3.0: '3. Probably not',
              4.0: '4. Definitely not'}}
...abbreviated to save space
newcols = ['personid','gender','birthmonth',
...   'birthyear','sampletype','category',
...   'satverbal','satmath','gpaoverall',
...   'gpaeng','gpamath','gpascience','govjobs',
...   'govprices','govhealth','goveld','govind',
...   'govunemp','govinc','govcollege',
...   'govhousing','govenvironment','bacredits',
...   'coltype1','coltype2','coltype3','coltype4',
...   'coltype5','coltype6','highestgrade',
...   'maritalstatus','childnumhome','childnumaway',
...   'degreecol1','degreecol2','degreecol3',
...   'degreecol4','wageincome','weeklyhrscomputer',
...   'weeklyhrstv','nightlyhrssleep',
...   'weeksworkedlastyear']
    - Set value labels and missing values, and change selected columns to the 
categorydata type. 
Use the setvalues dictionary to replace existing values with value labels. Replace all values from –9 to –1 with NaN:
nls97r.replace(setvalues, inplace=True)
nls97r.head()
         R0000100   R0536300  ...     U2963000  Z9063900
0    1          Female    ...     -5        52
1    2          Male      ...     6         0
2    3          Female    ...     6         0
3    4          Female    ...     6         4
4    5          Male      ...     5         12
[5 rows x 42 columns]
    nls97r.replace(list(range(-9,0)), np.nan, inplace=True)
for col in nls97r[[k for k in setvalues]].columns:
...     nls97r[col] = nls97r[col].astype('category')
...
nls97r.dtypes
    R0000100	int64
R0536300	category
R0536401	int64
R0536402	int64
R1235800	category
              ... 
U2857300	category
U2962800	category
U2962900	category
U2963000	float64
Z9063900	float64
Length: 42, dtype: object
    - Set meaningful column headings:
        
nls97r.columns = newcols nls97r.dtypespersonid int64 gender category birthmonth int64 birthyear int64 sampletype category ... wageincome category weeklyhrscomputer category weeklyhrstv category nightlyhrssleep float64 weeksworkedlastyear float64 Length: 42, dtype: object 
This shows how R data files can be imported into pandas and value labels assigned.
How it works…
Reading R data into pandas with pyreadr is fairly straightforward. Passing a filename to the read_r function is all that is required. Since read_r can return multiple objects with one call, we need to specify which object. When reading an rds file (as opposed to an rdata file), only one object is returned. It has the key None.
In Step 3, we loaded a dictionary that maps our variables to value labels, and a list for our preferred column headings. In Step 4 we applied the value labels. We also changed the data type to category for the columns where we applied the values. We did this by generating a list of the keys in our setvalues dictionary with [k for k in setvalues] and then iterating over those columns.
We change the column headings in Step 5 to ones that are more intuitive. Note that the order matters here. We need to set the value labels before changing the column names, since the setvalues dictionary is based on the original column headings.
The main advantage of using pyreadr to read R files directly into pandas is that we do not have to convert the R data into a CSV file first. Once we have written our Python code to read the file, we can just rerun it whenever the R data changes. This is particularly helpful when we do not have R on the machine where we work.
There’s more…
Pyreadr is able to return multiple DataFrames. This is useful when we save several data objects in R as an rdata file. We can return all of them with one call.
Pprint is a handy tool for improving the display of Python dictionaries.
We could have used rpy2 instead of pyreadr to import R data. rpy2 requires that R also be installed, but it is more powerful than pyreadr. It will read R factors and automatically set them to pandas DataFrame values. See the following code:
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
readRDS = robjects.r['readRDS']
nls97withvalues = readRDS('data/nls97withvalues.rds')
nls97withvalues
              R0000100      R0536300     ...    U2963000         Z9063900
1         1             Female       ...    -2147483648      52
2         2             Male         ...    6                0
3         3             Female       ...    6                0
4         4             Female       ...    6                4
5         5             Male         ...    5                12
...       ...           ...          ...    ...             ...
8980     9018           Female       ...    4                49
8981     9019           Male         ...    6                0
8982     9020           Male         ...    -2147483648      15
8983     9021           Male         ...    7                50
8984     9022           Female       ...    7                20
[8984 rows x 42 columns]
    This generates unusual –2147483648 values. This is what happened when readRDS interpreted missing data in numeric columns. A global replacement of that number with NaN, after confirming that that is not a valid value, would be a good next step.
See also
Clear instructions and examples for pyreadr are available at https://github.com/ofajardo/pyreadr.
Feather files, a relatively new format, can be read by both R and Python. I discuss those files in the next recipe.