Reader small image

You're reading from  Learning Spark SQL

Product typeBook
Published inSep 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785888359
Edition1st Edition
Languages
Right arrow

Using SparkR for computing summary statistics


The describe (or summary) operation creates a new that contains count, mean, max, mean, and standard deviation values for a specified DataFrame or a list of numerical columns:

> sumstatsdf <- describe(df, "duration", "campaign", "previous", "age")

> showDF(sumstatsdf)

Computing these values on a large Dataset can be computationally expensive. Hence, we present the individual computation of these statistical measures here:

> avgagedf <- agg(df, mean = mean(df$age))

> showDF(avgagedf) # Print this DF
+-----------------+
| mean            |
+-----------------+
|40.02406040594348|
+-----------------+

Next, we create a DataFrame that lists the minimum and maximum values and the range width:

> agerangedf <- agg(df, minimum = min(df$age), maximum = max(df$age), range_width = abs(max(df$age) - min(df$age)))

> showDF(agerangedf)

 

Next, we compute the sample variance and standard deviation as shown here:

> agevardf <- agg...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Learning Spark SQL
Published in: Sep 2017Publisher: PacktISBN-13: 9781785888359