R vs Pandas

Femi Anthony

February 2016

This article focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This article assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/.

By the end of the article, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this article include the following:

  • R data types and their pandas equivalents
  • Slicing and selection
  • Arithmetic operations on datatype columns
  • Aggregation and GroupBy
  • Matching
  • Split-apply-combine
  • Melting and reshaping
  • Factors and categorical data

This article focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This article assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/.

By the end of the article, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this article include the following:

  • R data types and their pandas equivalents
  • Slicing and selection
  • Arithmetic operations on datatype columns
  • Aggregation and GroupBy
  • Matching
  • Split-apply-combine
  • Melting and reshaping
  • Factors and categorical data

(For more resources related to this topic, see here.)

R data types

R has five primitive or atomic types:

  • Character
  • Numeric
  • Integer
  • Complex
  • Logical/Boolean

It also has the following, more complex, container types:

  • Vector: This is similar to numpy.array. It can only contain objects of the same type.
  • List: It is a heterogeneous container. Its equivalent in pandas would be a series.
  • DataFrame: It is a heterogeneous 2D container, equivalent to a pandas DataFrame
  • Matrix:- It is a homogeneous 2D version of a vector. It is similar to a numpy.matrix.

For this article, we will focus on list and DataFrame, which have pandas equivalents as series and DataFrame.

For more information on R data types, refer to the following document at: http://www.statmethods.net/input/datatypes.html.

For Numpy data types, refer to the following document at: http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html and http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html.

R lists

R lists can be created explicitly as a list declaration as shown here:

>h_lst<- list(23,'donkey',5.6,1+4i,TRUE)

>h_lst

[[1]]

[1] 23

 

[[2]]

[1] "donkey"

 

[[3]]

[1] 5.6

 

[[4]]

[1] 1+4i

 

[[5]]

[1] TRUE

 

>typeof(h_lst)

[1] "list"

Here is its series equivalent in pandas with the creation of a list and the creation of a series from it:

In [8]: h_list=[23, 'donkey', 5.6,1+4j, True]

In [9]: import pandas as pd

        h_ser=pd.Series(h_list)

In [10]: h_ser

Out[10]: 0        23

         1    donkey

         2       5.6

         3    (1+4j)

         4      True

dtype: object

Array indexing starts from 0 in pandas as opposed to R, where it starts at 1. Following is an example of this:

In [11]: type(h_ser)

Out[11]: pandas.core.series.Series

R DataFrames

We can construct an R DataFrame as follows by calling the data.frame() constructor and then display it as follows:

>stocks_table<- data.frame(Symbol=c('GOOG','AMZN','FB','AAPL',

                                      'TWTR','NFLX','LINKD'),

                            Price=c(518.7,307.82,74.9,109.7,37.1,

                                           334.48,219.9),

MarketCap=c(352.8,142.29,216.98,643.55,23.54,20.15,27.31))

 

>stocks_table

Symbol  PriceMarketCap

1   GOOG 518.70    352.80

2   AMZN 307.82    142.29

3     FB  74.90    216.98

4   AAPL 109.70    643.55

5   TWTR  37.10     23.54

6   NFLX 334.48     20.15

7  LINKD 219.90     27.31

Here, we construct a pandas DataFrame and display it:

In [29]: stocks_df=pd.DataFrame({'Symbol':['GOOG','AMZN','FB','AAPL',

                                           'TWTR','NFLX','LNKD'],

                                 'Price':[518.7,307.82,74.9,109.7,37.1,

         334.48,219.9],

'MarketCap($B)' : [352.8,142.29,216.98,643.55,

                                                    23.54,20.15,27.31]

                                 })

stocks_df=stocks_df.reindex_axis(sorted(stocks_df.columns,reverse=True),axis=1)

stocks_df

Out[29]:

Symbol  PriceMarketCap($B)

0       GOOG    518.70  352.80

1       AMZN    307.82  142.29

2       FB      74.90   216.98

3       AAPL    109.70  643.55

4       TWTR    37.10   23.54

5       NFLX    334.48  20.15

6       LNKD219.90  27.31

Slicing and selection

In R, we slice objects in the following three ways:

  • [: This always returns an object of the same type as the original and can be used to select more than one element.
  • [[: This is used to extract elements of list or DataFrame; and can only be used to extract a single element,: the type of the returned element will not necessarily be a list or DataFrame.
  • $: This is used to extract elements of a list or DataFrame by name and is similar to [[.

Here are some slicing examples in R and their equivalents in Pandas:

R-matrix and Numpy array compared

Let's see matrix creation and selection in R:

>r_mat<- matrix(2:13,4,3)

>r_mat

     [,1] [,2] [,3]

[1,]    2    6   10

[2,]    3    7   11

[3,]    4    8   12

[4,]    5    9   13

To select first row, we write:

>r_mat[1,]

[1]  2  6 10

To select second column, we use the following command:

>r_mat[,2]

[1] 6 7 8 9

Let's now see Numpy array creation and selection:

In [60]: a=np.array(range(2,6))

         b=np.array(range(6,10))

         c=np.array(range(10,14))

In [66]: np_ar=np.column_stack([a,b,c])

np_ar

Out[66]: array([[ 2,  6, 10],

[ 3,  7, 11],

[ 4,  8, 12],

[ 5,  9, 13]])

To select first row, write the following command:

In [79]: np_ar[0,]

Out[79]: array([ 2,  6, 10])

Indexing is different in R and pandas/Numpy.

In R, indexing starts at 1, while in pandas/Numpy, it starts at 0. Hence, we have to subtract 1 from all indexes when making the translation from R to pandas/Numpy.

To select second column, write the following command:

In [81]: np_ar[:,1]

Out[81]: array([6, 7, 8, 9])

Another option is to transpose the array first and then select the column, as follows:

In [80]: np_ar.T[1,]

Out[80]: array([6, 7, 8, 9])

R lists and pandas series compared

Here is an example of list creation and selection in R:

>cal_lst<- list(weekdays=1:8, mth='jan')

>cal_lst

$weekdays

[1] 1 2 3 4 5 6 7 8

 

$mth

[1] "jan"

 

>cal_lst[1]

$weekdays

[1] 1 2 3 4 5 6 7 8

 

>cal_lst[[1]]

[1] 1 2 3 4 5 6 7 8

 

>cal_lst[2]

$mth

[1] "jan"

Series creation and selection in pandas is done as follows:

In [92]: cal_df= pd.Series({'weekdays':range(1,8), 'mth':'jan'})

In [93]: cal_df

Out[93]: mthjan

weekdays    [1, 2, 3, 4, 5, 6, 7]

dtype: object

 

In [97]: cal_df[0]

Out[97]: 'jan'

 

In [95]: cal_df[1]

Out[95]: [1, 2, 3, 4, 5, 6, 7]

 

In [96]: cal_df[[1]]

Out[96]: weekdays    [1, 2, 3, 4, 5, 6, 7]

dtype: object

Here, we see a difference between an R-list and a pandas series from the perspective of the [] and [[]] operators. We can see the difference by considering the second item, which is a character string.

In the case of R, the [] operator produces a container type, that is, a list containing the string, while the [[]] produces an atomic type: in this case, a character as follows:

>typeof(cal_lst[2])

[1] "list"

>typeof(cal_lst[[2]])

[1] "character"

In the case of pandas, the opposite is true: [] produces the atomic type, while [[]] results in a complex type, that is, a series as follows:

In [99]: type(cal_df[0])

Out[99]: str

 

In [101]: type(cal_df[[0]])

Out[101]: pandas.core.series.Series

In both R and pandas, the column name can be specified in order to obtain an element.

Specifying column name in R

In R, this can be done with the column name preceded by the $ operator as follows:

>cal_lst$mth

[1] "jan"

> cal_lst$'mth'

[1] "jan"

Specifying column name in pandas

In pandas, we subset elements in the usual way with the column name in square brackets:

In [111]: cal_df['mth']

Out[111]: 'jan'

One area where R and pandas differ is in the subsetting of nested elements. For example, to obtain day 4 from weekdays, we have to use the [[]] operator in R:

>cal_lst[[1]][[4]]

[1] 4

 

>cal_lst[[c(1,4)]]

[1] 4

However, in the case of pandas, we can just use a double []:

In [132]: cal_df[1][3]

Out[132]: 4

R DataFrames versus pandas DataFrames

Selecting data in R DataFrames and pandas DataFrames follows a similar script. The following section explains on how we perform multi-column selects from both.

Multi-column selection in R

In R, we specify the multiple columns to select by stating them in a vector within square brackets:

>stocks_table[c('Symbol','Price')]

Symbol  Price

1   GOOG 518.70

2   AMZN 307.82

3     FB  74.90

4   AAPL 109.70

5   TWTR  37.10

6   NFLX 334.48

7  LINKD 219.90

 

>stocks_table[,c('Symbol','Price')]

Symbol  Price

1   GOOG 518.70

2   AMZN 307.82

3     FB  74.90

4   AAPL 109.70

5   TWTR  37.10

6   NFLX 334.48

7  LINKD 219.90

Multi-column selection in pandas

In pandas, we subset elements in the usual way with the column names in square brackets:

In [140]: stocks_df[['Symbol','Price']]

Out[140]:Symbol Price

0        GOOG   518.70

1        AMZN   307.82

2        FB     74.90

3        AAPL   109.70

4        TWTR   37.10

5        NFLX   334.48

6        LNKD   219.90

 

In [145]: stocks_df.loc[:,['Symbol','Price']]

Out[145]: Symbol  Price

0         GOOG    518.70

1         AMZN    307.82

2         FB      74.90

3         AAPL    109.70

4         TWTR    37.10

5         NFLX    334.48

6         LNKD    219.90

Arithmetic operations on columns

In R and pandas, we can apply arithmetic operations in data columns in a similar manner. Hence, we can perform arithmetic operations such as addition or subtraction on elements in corresponding positions in two or more DataFrames.

Here, we construct a DataFrame in R with columns labeled x and y, and subtract column y from column x:

>norm_df<- data.frame(x=rnorm(7,0,1), y=rnorm(7,0,1))

>norm_df$x - norm_df$y

[1] -1.3870730  2.4681458 -4.6991395  0.2978311 -0.8492245  1.5851009 -1.4620324

The with operator in R also has the same effect as arithmetic operations:

>with(norm_df,x-y)

[1] -1.3870730  2.4681458 -4.6991395  0.2978311 -0.8492245  1.5851009 -1.4620324

In pandas, the same arithmetic operations on columns can be done and the equivalent operator is eval

In [10]: import pandas as pd

         import numpy as np

df = pd.DataFrame({'x': np.random.normal(0,1,size=7), 'y': np.random.normal(0,1,size=7)})

 

In [11]: df.x-df.y

Out[11]: 0   -0.107313

         1    0.617513

         2   -1.517827

         3    0.565804

         4   -1.630534

         5    0.101900

         6    0.775186

dtype: float64

 

In [12]: df.eval('x-y')

Out[12]: 0   -0.107313

         1    0.617513

         2   -1.517827

         3    0.565804

         4   -1.630534

         5    0.101900

         6    0.775186

dtype: float64

Aggregation and GroupBy

Sometimes, we may wish to split data into subsets and apply a function such as the mean, max, or min to each subset. In R, we can do this via the aggregate or tapply functions.

Here, we will use the example of a dataset of statistics on the top five strikers of the four clubs that made it to the semi-final of the European Champions League Football tournament in 2014. We will use it to illustrate aggregation in R and its equivalent GroupBy functionality in pandas.

Aggregation in R

In R aggregation is done using the following command:

> goal_stats=read.csv('champ_league_stats_semifinalists.csv')

>goal_stats

              Club                 Player Goals GamesPlayed

1  Atletico Madrid            Diego Costa     8           9

2  Atletico Madrid             ArdaTuran     4           9

3  Atletico Madrid            RaúlGarcía     4          12

4  Atletico Madrid           AdriánLópez     2           9

5  Atletico Madrid            Diego Godín     2          10

6      Real Madrid      Cristiano Ronaldo    17          11

7      Real Madrid            Gareth Bale     6          12

8      Real Madrid          Karim Benzema     5          11

9      Real Madrid                   Isco     3          12

10     Real Madrid         Ángel Di María     3          11

11   Bayern Munich          Thomas Müller     5          12

12   Bayern Munich           ArjenRobben     4          10

13   Bayern Munich            Mario Götze     3          11

14   Bayern Munich Bastian Schweinsteiger     3           8

15   Bayern Munich        Mario Mandžukić     3          10

16         Chelsea        Fernando Torres     4           9

17         Chelsea               Demba Ba     3           6

18         Chelsea           Samuel Eto'o     3           9

19         Chelsea            Eden Hazard     2           9

20         Chelsea                Ramires     2          10

We can now compute the goals per game ratio for each striker, to measure their deadliness in front of a goal:

>goal_stats$GoalsPerGame<- goal_stats$Goals/goal_stats$GamesPlayed

>goal_stats

              Club   Player         Goals GamesPlayedGoalsPerGame

1  Atletico Madrid  Diego Costa     8           9    0.8888889

2  Atletico Madrid  ArdaTuran      4           9    0.4444444

3  Atletico Madrid  RaúlGarcía     4          12    0.3333333

4  Atletico Madrid  AdriánLópez    2           9    0.2222222

5  Atletico Madrid  Diego Godín     2          10    0.2000000

6  Real Madrid  Cristiano Ronaldo  17          11    1.5454545

7  Real Madrid  Gareth Bale         6          12    0.5000000

8  Real Madrid    Karim Benzema     5          11    0.4545455

9  Real Madrid       Isco           3          12    0.2500000

10 Real Madrid  Ángel Di María     3          11    0.2727273

11 Bayern Munich Thomas Müller     5          12    0.4166667

12 Bayern Munich  ArjenRobben     4          10    0.4000000

13 Bayern Munich  MarioGötze      3          11    0.2727273

14 Bayern Munich Bastian Schweinsteiger 3      8    0.3750000

15 Bayern Munich  MarioMandžukić  3          10    0.3000000

16 Chelsea       Fernando Torres   4           9    0.4444444

17 Chelsea           Demba Ba      3           6    0.5000000

18 Chelsea           Samuel Eto'o  3           9    0.3333333

19 Chelsea            Eden Hazard  2           9    0.2222222

20 Chelsea                Ramires  2          10    0.2000000

Let's suppose that we wanted to know the highest goals per game ratio for each team. We would calculate this as follows:

>aggregate(x=goal_stats[,c('GoalsPerGame')], by=list(goal_stats$Club),FUN=max)

          Group.1         x

1 Atletico Madrid 0.8888889

2   Bayern Munich 0.4166667

3         Chelsea 0.5000000

4     Real Madrid 1.5454545

The tapply function is used to apply a function to a subset of an array or vector that is defined by one or more columns. The tapply function can also be used as follows:

>tapply(goal_stats$GoalsPerGame,goal_stats$Club,max)

Atletico Madrid   Bayern Munich         Chelsea     Real Madrid

      0.8888889       0.4166667       0.5000000       1.5454545

The pandas' GroupBy operator

In pandas, we can achieve the same result by using the GroupBy function:

In [6]: import pandas as pd

importnumpy as np

In [7]: goal_stats_df=pd.read_csv('champ_league_stats_semifinalists.csv')

 

In [27]: goal_stats_df['GoalsPerGame']=     goal_stats_df['Goals']/goal_stats_df['GamesPlayed']

 

In [27]: goal_stats_df['GoalsPerGame']= goal_stats_df['Goals']/goal_stats_df['GamesPlayed']

 

In [28]: goal_stats_df

Out[28]: Club           Player      Goals GamesPlayedGoalsPerGame

0       Atletico Madrid Diego Costa   8       9        0.888889

1       Atletico Madrid ArdaTuran    4       9         0.444444

2       Atletico Madrid RaúlGarcía   4       12        0.333333

3       Atletico Madrid AdriánLópez  2       9         0.222222

4       Atletico Madrid Diego Godín   2       10        0.200000

5       Real Madrid  Cristiano Ronaldo 17      11        1.545455

6       Real Madrid     Gareth Bale   6       12        0.500000

7       Real Madrid     Karim Benzema 5       11        0.454545

8       Real Madrid     Isco          3       12        0.250000

9       Real Madrid     Ángel Di María 3      11        0.272727

10      Bayern Munich   Thomas Müller  5       12        0.416667

11      Bayern Munich   ArjenRobben   4       10        0.400000

12      Bayern Munich   Mario Götze    3       11        0.272727

13      Bayern Munich  BastianSchweinsteiger 3   8     0.375000

14      Bayern Munich  MarioMandžukić  3       10        0.300000

15      Chelsea        Fernando Torres  4       9         0.444444

16      Chelsea        Demba Ba         3       6         0.500000

17      Chelsea        Samuel Eto'o     3       9         0.333333

18      Chelsea        Eden Hazard      2       9         0.222222

19      Chelsea        Ramires          2       10        0.200000

 

In [30]: grouped = goal_stats_df.groupby('Club')

 

In [17]: grouped['GoalsPerGame'].aggregate(np.max)

Out[17]: Club

         Atletico Madrid    0.888889

         Bayern Munich      0.416667

         Chelsea            0.500000

         Real Madrid        1.545455

         Name: GoalsPerGame, dtype: float64

 

In [22]: grouped['GoalsPerGame'].apply(np.max)

 

Out[22]: Club

         Atletico Madrid    0.888889

         Bayern Munich      0.416667

         Chelsea            0.500000

         Real Madrid        1.545455

         Name: GoalsPerGame, dtype: float64

Comparing matching operators in R and pandas

Here, we will demonstrate the equivalence of matching operators between R (%in%) and pandas (isin()). In both cases, a logical vector or series (pandas) is produced, which indicates the position at which a match was found.

R %in% operator

Here, we will demonstrate the use of the %in% operator in R:

>stock_symbols=stocks_table$Symbol

>stock_symbols

[1] GOOG  AMZN  FB  AAPL  TWTR  NFLX  LINKD

Levels: AAPL AMZN FB GOOG LINKD NFLX TWTR

 

>stock_symbols %in% c('GOOG','NFLX')

[1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE

The pandas isin() function

Here is an example of using the pandas isin() function:

In [11]: stock_symbols=stocks_df.Symbol

stock_symbols

Out[11]: 0    GOOG

         1    AMZN

         2      FB

         3    AAPL

         4    TWTR

         5    NFLX

         6    LNKD

         Name: Symbol, dtype: object

In [10]: stock_symbols.isin(['GOOG','NFLX'])

Out[10]: 0     True

         1    False

         2    False

         3    False

         4    False

         5     True

         6    False

         Name: Symbol, dtype: bool

Logical subsetting

In R as well as in pandas, there is more than one way to perform logical subsetting. Suppose that we wished to display all players with the average goals per game ratio of greater than or equal to 0.5; that is, they average at least one goal every two games.

Logical subsetting in R

Here's how we can do this in R:

  • Via a logical slice:
    >goal_stats[goal_stats$GoalsPerGame>=0.5,]
    
       Club            Player        Goals GamesPlayedGoalsPerGame
    
    1  Atletico Madrid Diego Costa     8           9    0.8888889
    
    6  Real Madrid Cristiano Ronaldo  17          11    1.5454545
    
    7  Real Madrid       Gareth Bale   6          12    0.5000000
    
    17 Chelsea          Demba Ba     3           6    0.5000000
  • Via the subset() function:
    >subset(goal_stats,GoalsPerGame>=0.5)
    
       Club            Player      Goals GamesPlayedGoalsPerGame
    
    1  Atletico Madrid Diego Costa    8           9    0.8888889
    
    6  Real Madrid Cristiano Ronaldo 17          11    1.5454545
    
    7  Real Madrid     Gareth Bale    6          12    0.5000000
    
    17 Chelsea          Demba Ba     3           6    0.5000000

Logical subsetting in pandas

In pandas, we also do something similar:

  • Logical slicing:
    In [33]: goal_stats_df[goal_stats_df['GoalsPerGame']>=0.5]
    
    Out[33]:     Club        Player            Goals GamesPlayedGoalsPerGame
    
    0    Atletico Madrid Diego Costa     8     9          0.888889
    
    5    Real Madrid   Cristiano Ronaldo 17    11         1.545455
    
    6    Real Madrid     Gareth Bale     6     12         0.500000
    
    16   Chelsea         Demba Ba        3     6           0.500000
  • DataFrame.query() operator:
    In [36]:  goal_stats_df.query('GoalsPerGame>= 0.5')
    
    Out[36]:
    
    Club              Player   Goals GamesPlayedGoalsPerGame
    
    0    Atletico Madrid Diego Costa   8     9            0.888889
    
    5    Real Madrid  Cristiano Ronaldo 17    11           1.545455
    
    6    Real Madrid     Gareth Bale    6     12           0.500000

    16   Chelsea         Demba Ba       3     6            0.500000

Split-apply-combine

R has a library called plyr for a split-apply-combine data analysis. The plyr library has a function called ddply, which can be used to apply a function to a subset of a DataFrame, and then, combine the results into another DataFrame.

For more information on ddply, you can refer to the following: http://www.inside-r.org/packages/cran/plyr/docs/ddply

To illustrate, let us consider a subset of a recently created dataset in R, which contains data on flights departing NYC in 2013: http://cran.r-project.org/web/packages/nycflights13/index.html.

Implementation in R

Here, we will install the package in R and instantiate the library:

>install.packages('nycflights13')

...

 

>library('nycflights13')

>dim(flights)

[1] 336776     16

 

>head(flights,3)

year month day dep_timedep_delayarr_timearr_delay carrier tailnum flight

1 2013     1   1      517         2      830        11      UA  N14228   1545

2 2013     1   1      533         4      850        20      UA  N24211   1714

3 2013     1   1      542         2      923        33      AA  N619AA   1141

origindestair_time distance hour minute

1    EWR  IAH      227     1400    5     17

2    LGA  IAH      227     1416    5     33

3    JFK  MIA      160     1089    5     42

 

> flights.data=na.omit(flights[,c('year','month','dep_delay','arr_delay','distance')])

>flights.sample<- flights.data[sample(1:nrow(flights.data),100,replace=FALSE),]

 

>head(flights.sample,5)

year month dep_delayarr_delay distance

155501 2013     3         2         5      184

2410   2013     1         0         4      762

64158  2013    11        -7       -27      509

221447 2013     5        -5       -12      184

281887 2013     8        -1       -10      937

The ddply function enables us to summarize the departure delays (mean, standard deviation) by year and month:

>ddply(flights.sample,.(year,month),summarize, mean_dep_delay=round(mean(dep_delay),2), s_dep_delay=round(sd(dep_delay),2))

year month mean_dep_delaysd_dep_delay

1  2013     1          -0.20         2.28

2  2013     2          23.85        61.63

3  2013     3          10.00        34.72

4  2013     4           0.88        12.56

5  2013     5           8.56        32.42

6  2013     6          58.14       145.78

7  2013     7          25.29        58.88

8  2013     8          25.86        59.38

9  2013     9          -0.38        10.25

10 2013    10           9.31        15.27

11 2013    11          -1.09         7.73

12 2013    12           0.00         8.58

Let us save the flights.sample dataset to a CSV file so that we can use the data to show us how to do the same thing in pandas:

>write.csv(flights.sample,file='nycflights13_sample.csv', quote=FALSE,row.names=FALSE)

Implementation in pandas

In order to do the same thing in pandas, we read the CSV file saved in the preceding section:

In [40]: flights_sample=pd.read_csv('nycflights13_sample.csv')

 

In [41]: flights_sample.head()

Out[41]: year   month   dep_delayarr_delay       distance

0        2013   3       2       5       184

1        2013   1       0       4       762

2        2013   11      -7      -27     509

3        2013   5       -5      -12     184

4        2013   8       -1      -10     937

We achieve the same effect as ddply by making use of the GroupBy() operator:

In [44]: pd.set_option('precision',3)

In [45]: grouped = flights_sample_df.groupby(['year','month'])

 

In [48]: grouped['dep_delay'].agg([np.mean, np.std])

 

Out[48]:        mean    std

year    month          

2013    1       -0.20   2.28

        2       23.85   61.63

        3       10.00   34.72

        4       0.88    12.56

        5       8.56    32.42

        6       58.14   145.78

        7       25.29   58.88

        8       25.86   59.38

        9       -0.38   10.25

        10      9.31    15.27

        11      -1.09   7.73

        12      0.00    8.58

Reshaping using Melt

The melt function converts data into a wide format to a single column consisting of unique ID-variable combinations.

The R melt() function

Here, we demonstrate the use of the melt() function in R. It produces long-format data in which the rows are unique variable-value combinations:

>sample4=head(flights.sample,4)[c('year','month','dep_delay','arr_delay')]

> sample4

         year month dep_delay arr_delay

  155501 2013     3          2         5

  2410   2013     1          0         4

  64158  2013    11         -7       -27

  221447 2013     5         -5       -12

 

>melt(sample4,id=c('year','month'))

         year month  variable value

       1 2013     3 dep_delay     2

       2 2013     1 dep_delay     0

       3 2013    11 dep_delay    -7

       4 2013     5 dep_delay    -5

       5 2013     3 arr_delay     5

       6 2013     1 arr_delay     4

       7 2013    11 arr_delay   -27

       8 2013     5 arr_delay   -12

> 

For more information, you can refer to the following: http://www.statmethods.net/management/reshape.html.

The pandas melt() function

In pandas, the melt function is similar:

In [55]: sample_4_df=flights_sample_df[['year','month','dep_delay', \

'arr_delay']].head(4)

In [56]: sample_4_df

Out[56]:    year   month dep_delay arr_delay

        0   2013   3       2       5

        1   2013   1       0       4

        2   2013   11      -7      -27

        3   2013   5       -5      -12

 

In [59]: pd.melt(sample_4_df,id_vars=['year','month'])

Out[59]: year   month   variable        value

        0   2013   3       dep_delay        2

        1   2013   1       dep_delay        0

        2   2013   11      dep_delay       -7

        3   2013   5       dep_delay       -5

        4   2013   3       arr_delay        5

        5   2013   1       arr_delay        4

        6   2013   11      arr_delay       -27

        7   2013   5       arr_delay       -12

The reference for this information is from: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-melt.

Factors/categorical data

R refers to categorical variables as factors, and the cut() function enables us to break a continuous numerical variable into ranges, and treat the ranges as factors or categorical variables, or to classify a categorical variable into a larger bin.

An R example using cut()

Here is an example in R:

clinical.trial<- data.frame(patient = 1:1000,

age = rnorm(1000, mean = 50, sd = 5),

year.enroll = sample(paste("19", 80:99, sep = ""),

                             1000, replace = TRUE))

 

>clinical.trial<- data.frame(patient = 1:1000,

+                              age = rnorm(1000, mean = 50, sd = 5),

+                              year.enroll = sample(paste("19", 80:99, sep = ""),

+                              1000, replace = TRUE))

>summary(clinical.trial)

patient            age         year.enroll

 Min.   :   1.0   Min.   :31.14   1995   : 61 

 1st Qu.: 250.8   1st Qu.:46.77   1989   : 60 

Median : 500.5   Median :50.14   1985   : 57 

 Mean   : 500.5   Mean   :50.14   1988   : 57 

 3rd Qu.: 750.2   3rd Qu.:53.50   1990   : 56 

 Max.   :1000.0   Max.   :70.15   1991   : 55 

                                  (Other):654 

>ctcut<- cut(clinical.trial$age, breaks = 5)> table(ctcut)

ctcut

(31.1,38.9] (38.9,46.7] (46.7,54.6] (54.6,62.4] (62.4,70.2]

         15         232         558         186           9

The reference for the preceding data can be found at: http://www.r-bloggers.com/r-function-of-the-day-cut/.

The pandas solution

Here is the equivalent of the earlier explained cut() function in pandas (only applies to Version 0.15+):

In [79]: pd.set_option('precision',4)

clinical_trial=pd.DataFrame({'patient':range(1,1001),

                                      'age' : np.random.normal(50,5,size=1000),

                 'year_enroll': [str(x) for x in np.random.choice(range(1980,2000),size=1000,replace=True)]})

 

In [80]: clinical_trial.describe()

Out[80]:        age       patient

count   1000.000  1000.000

mean    50.089    500.500

std     4.909     288.819

min     29.944    1.000

        25%     46.572    250.750

        50%     50.314    500.500

        75%     53.320    750.250

max     63.458    1000.000

 

 

In [81]: clinical_trial.describe(include=['O'])

Out[81]:        year_enroll

count   1000

unique  20

top     1992

freq    62

 

 

In [82]: clinical_trial.year_enroll.value_counts()[:6]

Out[82]: 1992    62

         1985    61

         1986    59

         1994    59

         1983    58

         1991    58

dtype: int64

In [83]: ctcut=pd.cut(clinical_trial['age'], 5)

In [84]: ctcut.head()

Out[84]: 0    (43.349, 50.052]

         1    (50.052, 56.755]

         2    (50.052, 56.755]

         3    (43.349, 50.052]

         4    (50.052, 56.755]

         Name: age, dtype: category

         Categories (5, object): [(29.91, 36.646] < (36.646, 43.349] < (43.349, 50.052] < (50.052, 56.755] < (56.755, 63.458]]

 

In [85]: ctcut.value_counts().sort_index()

Out[85]: (29.91, 36.646]       3

              (36.646, 43.349]     82

       (43.349, 50.052]    396

       (50.052, 56.755]    434

      (56.755, 63.458]     85

dtype: int64

Summary

In this article, we have attempted to compare key features in R with their pandas equivalents in order to achieve the following objectives:

  • To assist R users who may wish to replicate the same functionality in pandas
  • To assist any users who upon reading some R code may wish to rewrite the code in Pandas

The reference documentation for this chapter can be found at http://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html.

For more information on Pandas you can also refer to https://www.packtpub.com/big-data-and-business-intelligence/mastering-pandas-finance.

Resources for Article:

 


Further resources on this subject:


You've been reading an excerpt of:

Mastering pandas

Explore Title