Reader small image

You're reading from  SQL Server 2017 Machine Learning Services with R.

Product typeBook
Published inFeb 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781787283572
Edition1st Edition
Languages
Right arrow
Authors (2):
Julie Koesmarno
Julie Koesmarno
author image
Julie Koesmarno

Julie Koesmarno is a senior program manager in the Database Systems Business Analytics team, at Microsoft. Currently, she leads big data analytics initiatives, driving business growth and customer success for SQL Server and Azure Data businesses. She has over 10 years of experience in data management, data warehousing, and analytics for multimillion-dollar businesses as a SQL Server developer, a system analyst, and a consultant prior to joining Microsoft. She is passionate about empowering data professionals to drive impacts for customer success and business through insights.
Read more about Julie Koesmarno

Tomaž Kaštrun
Tomaž Kaštrun
author image
Tomaž Kaštrun

Toma Katrun is a SQL Server developer and data scientist with more than 15 years of experience in the fields of business warehousing, development, ETL, database administration, and query tuning. He holds over 15 years of experience in data analysis, data mining, statistical research, and machine learning. He is a Microsoft SQL Server MVP for data platform and has been working with Microsoft SQL Server since version 2000. He is a blogger, author of many articles, a frequent speaker at the community and Microsoft events. He is an avid coffee drinker who is passionate about fixed gear bikes.
Read more about Tomaž Kaštrun

View More author details
Right arrow

RevoScaleR Package

The RevoScaleR package comes with Microsoft Machine Learning R Server and R Services. It is also available with R Client, but with some limitations discussed in Chapter 2, Overview of Microsoft Machine Learning Server and SQL Server. Given the rapid development and constant upgrades, this chapter will cover version 8.X and version 9.X-the latter is also available with SQL Server 2017. Changes and upgrades in version 9.X are not to be overlooked and will be covered as well.

The following topics are covered in this chapter:

  • Limitations of R challenged
  • Scalable and distributive computational environment
  • Functions for data preparation
  • Functions for descriptive statistics
  • Functions for statistical tests and sampling
  • Functions for predictive modeling

Primarily, this R package is designed to be handled in ecosystems where clients would be connecting to Microsoft...

Overcomming R language limitations

Prior to SQL Server 2016 (and 2017) BI and data scientists had the OLAP cubes, DMX language, and all super awesome and cool Microsoft algorithms available within SQL Server Analysis Services (SSAS). But, with rapid changes and bigger market demands, the need for integration of an open-source product (whether R, Python, Perl,or any other) was practically already there. And the next logical step was to integrate it with one. Microsoft sought a solution and ended up acquiring Revolution Analytics, which has put them on track again. Revolution R has addressed major issues concerning the R language.

Microsoft addressed R's limitations. Many of these limitations were aimed at faster data exploration and parallel programming techniques in R. In addition to this, also MKL computations have been enhanced, therefore making matrix-wise calculations...

Scalable and distributive computational environments

The RevoScaleR package has the following functions available, which will be covered in detail throughout the chapter.

To get a list of all the ScaleR functions, the following T-SQL can be used:

EXEC sp_execute_external_script
      @language = N'R'
      ,@script = N'require(RevoScaleR)
                        OutputDataSet <- data.frame(ls("package:RevoScaleR"))'
WITH RESULT SETS
      (( Functions NVARCHAR(200)))  

You get a table in SSMS with all the relevant rx functions that can be used with the RevoScaleR package.

Based on the list of these functions, a simpler and better overview of the functions can be prepared:

Figure 1: List of RevoScaleR functions (source: Microsoft)

Functions for data preparation

Importing data is the first of the many processes in data preparation. Importing data is a process of bringing data into your system from any external system using an external file or by establishing a connection to a live data source. In the following part, we will look at importing data that is stored as SPSS or SAS files and using an ODBC connection string to connect directly to an external live database system.

Data import from SAS, SPSS, and ODBC

Importing data into R or SQL Server tables is not the main focus of RevoScaleR library, but since this is on the list, let's briefly look into it. In this manner, based on your data source, the RevoScaleR package gives many abilities to connect...

Variable creation and data transformation

Variable creation and data transformation are two processes when defining data munging and data wrangling tasks. These tasks are important for proper data preparation and make it easier to analyze data for future tasks.

The functions that we will be exploring are as follows:

  • Variable creation and recoding
  • Data transformation
  • Handling missing values
  • Sorting, merging, and splitting datasets
  • Aggregate by category (which means sums), which is similar to T-SQL aggregations and Windows functions

This part will cover some of the following functions, mainly focusing on data transformation, handling missing values, and splitting datasets:

RxDataSource, rxDataStep, rxDataStepXdf, RxFileSystem, rxFindFileInPath, rxFindPackage, rxFisherTest, RxForeachDoPar, rxGetInfo, rxGetInfoXdf, rxGetJobInfo, rxGetJobInfo, rxGetOption, rxGetVarInfo, rxGetVarNames...

Variable creation and recoding

Using rxGetVarInfo will expose the information about the data.frame to the sp_execute_external_script output. It is obvious that some of these functions were never designed for presenting the output to data.frame, but were designed only for exploring the dataset. Some of these functions-for example, rxGetVarInfo-will give a nice output in the R environment, but will be hard to manipulate in data frames for outputting in the SQL Server database:

EXEC sp_execute_external_script
      @language = N'R'
      ,@script = N'
                  library(RevoScaleR)
                  df_sql <- InputDataSet        
                  var_info <- rxGetVarInfo(df_sql)
                  OutputDataSet <- data.frame(unlist(var_info))'
      ,@input_data_1 = N'
                  SELECT 
                   BusinessEntityID
        ...

Dataset subsetting

Subsetting the data is also relatively straightforward using the rxDataStep() function:

EXEC sp_execute_external_script
      @language = N'R'
      ,@script = N'
                  library(RevoScaleR)
                  df_sql <- InputDataSet
                  df_sql_subset <- rxDataStep(inData = df_sql, varsToKeep = NULL, rowSelection = (BusinessEntityID<=1000))
                  OutputDataSet <- df_sql_subset'
      ,@input_data_1 = N'
                  SELECT 
                   BusinessEntityID
                  ,[Name]
                  ,SalesPersonID
                  FROM [Sales].[Store]'
WITH RESULT SETS
      ((
       BusinessEntityID INT
      ,[Name] NVARCHAR(MAX)
      ,SalesPersonID INT
      ));
  

Keep in mind that subsetting operations using R code might bring unnecessary memory and I/O costs, especially...

Dataset merging

The rxMerge() function merges two datasets into one. The datasets must be a dataframe (or XDF format) and operate similarly to the JOIN clause in T-SQL (the rxMerge() function should not be confused with T-SQL's MERGE statement). Two datasets are merged based on one or more variables using the matchVars argument. In addition, when using the local compute context (which we are using in the next sample), the sorting of the data needs to be defined as well, since data.frames-as a collection of vectors-in R are not presorted or do not hold any sorts whatsoever. So, if no presorting is done, the autoSort argument must be set to true (autosort = TRUE):

EXEC sp_execute_external_script
      @language = N'R'
      ,@script = N'
      library(RevoScaleR)
      df_sql <- InputDataSet
      someExtraData <- data.frame(BusinessEntityID = 1:1200, department...

Functions for descriptive statistics

Descriptive statistics give insights into understanding data. These are summary statistics that describe a given dataset by summarizing features and measures, such as central tendency and measure of spread (or variability). Central tendency includes calculation of the mean, median, mode, whereas measures of variability include range, quartiles, minimum and maximum value, variance and standard deviation, as well as skewness and kurtosis.

These statistics are covered byrx- functions in RevoScaleR package, which means that you can use all the computational advantages of the package by calling: rxSummary, rxCrossTabs, rxMarginals, rxQuantile, rxCube, and rxHistogram, without worrying about the performance, out of memory exceptions, or which R package holds the right function.

We will be using the[Sales].[vPersonDemographics] view in the AdventureWorks...

Functions for statistical tests and sampling

Statistical tests are important for determining the correlation between two (or more) variables and what is their direction of correlation (positive, neutral, or negative). Statistically speaking, the correlation is a measure of the strength of the association between two variables and their direction. The RevoScaleR package supports calculation of Chi-square, Fischer, and Kendall rank correlation. Based on the types of variable, you can distinguish between Kendall, Spearman, or Pearson correlation coefficient.

For Chi-Square test, we will be using the rxChiSquareTest() function that uses the contingency table to see if two variables are related. A small chi-square test statistic means that the observed data fits your expected data very well, denoting there is a correlation, respectively. The formula for calculating chi-square is as...

Summary

This chapter has covered important functions (among many others) for data manipulation and data wrangling. These steps are absolutely and utterly important for understanding the structure of the dataset, the content of the dataset, and how the data is distributed. These are used to mainly understand frequencies, descriptive statistics, and also some statistical sampling, as well as statistical correlations.

These steps must be done (or should be done) prior to data cleaning and data merging in order to get a better understanding of the data. Cleaning the data is of the highest importance, as outliers might bring sensitive data (or any kind of data) to strange or false conclusions: it might also sway the results in some other direction. So, treating these steps as highly important by using the powerful rx- functions (or classes) should be the task of every data engineer...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
SQL Server 2017 Machine Learning Services with R.
Published in: Feb 2018Publisher: PacktISBN-13: 9781787283572
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Julie Koesmarno

Julie Koesmarno is a senior program manager in the Database Systems Business Analytics team, at Microsoft. Currently, she leads big data analytics initiatives, driving business growth and customer success for SQL Server and Azure Data businesses. She has over 10 years of experience in data management, data warehousing, and analytics for multimillion-dollar businesses as a SQL Server developer, a system analyst, and a consultant prior to joining Microsoft. She is passionate about empowering data professionals to drive impacts for customer success and business through insights.
Read more about Julie Koesmarno

author image
Tomaž Kaštrun

Toma Katrun is a SQL Server developer and data scientist with more than 15 years of experience in the fields of business warehousing, development, ETL, database administration, and query tuning. He holds over 15 years of experience in data analysis, data mining, statistical research, and machine learning. He is a Microsoft SQL Server MVP for data platform and has been working with Microsoft SQL Server since version 2000. He is a blogger, author of many articles, a frequent speaker at the community and Microsoft events. He is an avid coffee drinker who is passionate about fixed gear bikes.
Read more about Tomaž Kaštrun