You're reading from SQL Server 2017 Machine Learning Services with R.

Product typeBook

Published inFeb 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781787283572

Edition1st Edition

Languages

SQL

Tools

R Services SQL Server

Concepts

Data Analysis

Authors (2):

Julie Koesmarno

Toma≈æ Ka≈°trun

View More author details

RevoScaleR Package

The RevoScaleR package comes with Microsoft Machine Learning R Server and R Services. It is also available with R Client, but with some limitations discussed in Chapter 2, Overview of Microsoft Machine Learning Server and SQL Server. Given the rapid development and constant upgrades, this chapter will cover version 8.X and version 9.X-the latter is also available with SQL Server 2017. Changes and upgrades in version 9.X are not to be overlooked and will be covered as well.

The following topics are covered in this chapter:

Limitations of R challenged
Scalable and distributive computational environment
Functions for data preparation
Functions for descriptive statistics
Functions for statistical tests and sampling
Functions for predictive modeling

Primarily, this R package is designed to be handled in ecosystems where clients would be connecting to Microsoft...

Overcomming R language limitations

Prior to SQL Server 2016 (and 2017) BI and data scientists had the OLAP cubes, DMX language, and all super awesome and cool Microsoft algorithms available within SQL Server Analysis Services (SSAS). But, with rapid changes and bigger market demands, the need for integration of an open-source product (whether R, Python, Perl,or any other) was practically already there. And the next logical step was to integrate it with one. Microsoft sought a solution and ended up acquiring Revolution Analytics, which has put them on track again. Revolution R has addressed major issues concerning the R language.

Microsoft addressed R's limitations. Many of these limitations were aimed at faster data exploration and parallel programming techniques in R. In addition to this, also MKL computations have been enhanced, therefore making matrix-wise calculations...

Scalable and distributive computational environments

The RevoScaleR package has the following functions available, which will be covered in detail throughout the chapter.

To get a list of all the ScaleR functions, the following T-SQL can be used:

EXEC sp_execute_external_script
      @language = N'R'
      ,@script = N'require(RevoScaleR)
                        OutputDataSet <- data.frame(ls("package:RevoScaleR"))'
WITH RESULT SETS
      (( Functions NVARCHAR(200)))

You get a table in SSMS with all the relevant rx functions that can be used with the RevoScaleR package.

Based on the list of these functions, a simpler and better overview of the functions can be prepared:

Figure 1: List of RevoScaleR functions (source: Microsoft)

Functions for data preparation

Importing data is the first of the many processes in data preparation. Importing data is a process of bringing data into your system from any external system using an external file or by establishing a connection to a live data source. In the following part, we will look at importing data that is stored as SPSS or SAS files and using an ODBC connection string to connect directly to an external live database system.

Data import from SAS, SPSS, and ODBC

Importing data into R or SQL Server tables is not the main focus of RevoScaleR library, but since this is on the list, let's briefly look into it. In this manner, based on your data source, the RevoScaleR package gives many abilities to connect...

Variable creation and data transformation

Variable creation and data transformation are two processes when defining data munging and data wrangling tasks. These tasks are important for proper data preparation and make it easier to analyze data for future tasks.

The functions that we will be exploring are as follows:

Variable creation and recoding
Data transformation
Handling missing values
Sorting, merging, and splitting datasets
Aggregate by category (which means sums), which is similar to T-SQL aggregations and Windows functions

This part will cover some of the following functions, mainly focusing on data transformation, handling missing values, and splitting datasets:

RxDataSource, rxDataStep, rxDataStepXdf, RxFileSystem, rxFindFileInPath, rxFindPackage, rxFisherTest, RxForeachDoPar, rxGetInfo, rxGetInfoXdf, rxGetJobInfo, rxGetJobInfo, rxGetOption, rxGetVarInfo, rxGetVarNames...

Variable creation and recoding

Using rxGetVarInfo will expose the information about the data.frame to the sp_execute_external_script output. It is obvious that some of these functions were never designed for presenting the output to data.frame, but were designed only for exploring the dataset. Some of these functions-for example, rxGetVarInfo-will give a nice output in the R environment, but will be hard to manipulate in data frames for outputting in the SQL Server database:

EXEC sp_execute_external_script
      @language = N'R'
      ,@script = N'
                  library(RevoScaleR)
                  df_sql <- InputDataSet        
                  var_info <- rxGetVarInfo(df_sql)
                  OutputDataSet <- data.frame(unlist(var_info))'
      ,@input_data_1 = N'
                  SELECT 
                   BusinessEntityID
        ...

Dataset subsetting

Subsetting the data is also relatively straightforward using the rxDataStep() function:

EXEC sp_execute_external_script
      @language = N'R'
      ,@script = N'
                  library(RevoScaleR)
                  df_sql <- InputDataSet
                  df_sql_subset <- rxDataStep(inData = df_sql, varsToKeep = NULL, rowSelection = (BusinessEntityID<=1000))
                  OutputDataSet <- df_sql_subset'
      ,@input_data_1 = N'
                  SELECT 
                   BusinessEntityID
                  ,[Name]
                  ,SalesPersonID
                  FROM [Sales].[Store]'
WITH RESULT SETS
      ((
       BusinessEntityID INT
      ,[Name] NVARCHAR(MAX)
      ,SalesPersonID INT
      ));

Keep in mind that subsetting operations using R code might bring unnecessary memory and I/O costs, especially...

Dataset merging

The rxMerge() function merges two datasets into one. The datasets must be a dataframe (or XDF format) and operate similarly to the JOIN clause in T-SQL (the rxMerge() function should not be confused with T-SQL's MERGE statement). Two datasets are merged based on one or more variables using the matchVars argument. In addition, when using the local compute context (which we are using in the next sample), the sorting of the data needs to be defined as well, since data.frames-as a collection of vectors-in R are not presorted or do not hold any sorts whatsoever. So, if no presorting is done, the autoSort argument must be set to true (autosort = TRUE):

EXEC sp_execute_external_script
      @language = N'R'
      ,@script = N'
      library(RevoScaleR)
      df_sql <- InputDataSet
      someExtraData <- data.frame(BusinessEntityID = 1:1200, department...

Functions for descriptive statistics

Descriptive statistics give insights into understanding data. These are summary statistics that describe a given dataset by summarizing features and measures, such as central tendency and measure of spread (or variability). Central tendency includes calculation of the mean, median, mode, whereas measures of variability include range, quartiles, minimum and maximum value, variance and standard deviation, as well as skewness and kurtosis.

These statistics are covered byrx- functions in RevoScaleR package, which means that you can use all the computational advantages of the package by calling: rxSummary, rxCrossTabs, rxMarginals, rxQuantile, rxCube, and rxHistogram, without worrying about the performance, out of memory exceptions, or which R package holds the right function.

We will be using the[Sales].[vPersonDemographics] view in the AdventureWorks...

Functions for statistical tests and sampling

Statistical tests are important for determining the correlation between two (or more) variables and what is their direction of correlation (positive, neutral, or negative). Statistically speaking, the correlation is a measure of the strength of the association between two variables and their direction. The RevoScaleR package supports calculation of Chi-square, Fischer, and Kendall rank correlation. Based on the types of variable, you can distinguish between Kendall, Spearman, or Pearson correlation coefficient.

For Chi-Square test, we will be using the rxChiSquareTest() function that uses the contingency table to see if two variables are related. A small chi-square test statistic means that the observed data fits your expected data very well, denoting there is a correlation, respectively. The formula for calculating chi-square is as...

Summary

This chapter has covered important functions (among many others) for data manipulation and data wrangling. These steps are absolutely and utterly important for understanding the structure of the dataset, the content of the dataset, and how the data is distributed. These are used to mainly understand frequencies, descriptive statistics, and also some statistical sampling, as well as statistical correlations.

These steps must be done (or should be done) prior to data cleaning and data merging in order to get a better understanding of the data. Cleaning the data is of the highest importance, as outliers might bring sensitive data (or any kind of data) to strange or false conclusions: it might also sway the results in some other direction. So, treating these steps as highly important by using the powerful rx- functions (or classes) should be the task of every data engineer...

The rest of the chapter is locked

You have been reading a chapter from

SQL Server 2017 Machine Learning Services with R.

Published in: Feb 2018Publisher: PacktISBN-13: 9781787283572

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Julie Koesmarno

Julie Koesmarno is a senior program manager in the Database Systems Business Analytics team, at Microsoft. Currently, she leads big data analytics initiatives, driving business growth and customer success for SQL Server and Azure Data businesses. She has over 10 years of experience in data management, data warehousing, and analytics for multimillion-dollar businesses as a SQL Server developer, a system analyst, and a consultant prior to joining Microsoft. She is passionate about empowering data professionals to drive impacts for customer success and business through insights.
Read more about Julie Koesmarno

Toma≈æ Ka≈°trun

Toma Katrun is a SQL Server developer and data scientist with more than 15 years of experience in the fields of business warehousing, development, ETL, database administration, and query tuning. He holds over 15 years of experience in data analysis, data mining, statistical research, and machine learning. He is a Microsoft SQL Server MVP for data platform and has been working with Microsoft SQL Server since version 2000. He is a blogger, author of many articles, a frequent speaker at the community and Microsoft events. He is an avid coffee drinker who is passionate about fixed gear bikes.
Read more about Toma≈æ Ka≈°trun

Other recommended products

Related to this chapter

Data Science with SQL Server Quick Start Guide

SQL Server started to fully support data science only with its last two editions. If you are a professional from both worlds, SQL Server and data science, and interested in using SQL Server and Machine Learning Services for their projects, then this is the ideal book for you.

BookAug 2018206 pages

Modern R Programming Cookbook

R is a powerful tool for statistics, graphics, and statistical programming. It is used by tens of thousands of people daily to perform serious statistical analyses. It is a free, open source system whose implementation is the collective accomplishment of many intelligent, hard-working people. There are more than 2,000 available add-ons, and R is a serious rival to all commercial statistical packages. . The objective of this book is to show how to work with different programming aspects of R. The emerging R developers and data science could have very good programming knowledge but might have limited understanding about R syntax and semantics. Our book will be a platform develop practical solution out of real world problem in scalable fashion and with very good understanding.

BookOct 2017236 pages

Hands-On Data Science with SQL Server 2017

Learn how to utilize Microsoft SQL Server with NoSQL concepts for data science challenges. This book will help enhance your knowledge beyond data querying & processing tasks by implementing a data science pipeline. We will implement data science tasks and show how to use them on a day-to-day basis for efficient smart predictive models.

BookNov 2018506 pages

SQL Server 2017 Developer's Guide

This book is your guide to exploring the various developer capabilities offered by SQL Server 2017. Model your data and the complex relationships within it, and integrate SQL Server with R and Python for efficient analytics. The book also covers the performance and troubleshooting aspects to help you develop efficient database applications.

BookMar 2018816 pages

SQL Server 2016 Developer's Guide

This book is designed to get you up to speed with SQL Server 2016, covering the essential concepts and techniques. By the end of this book, you’ll be able to design efficient, high-performance database applications confidently.

BookMar 2017616 pages

Introducing Microsoft SQL Server 2019

Introducing Microsoft SQL Server 2019 takes you through what’s new in SQL Server 2019 and why it matters. After reading this book, you’ll be well placed to explore exactly how you can make MIcrosoft SQL Server 2019 work best for you.

BookApr 2020488 pages

Learn T-SQL Querying

T-SQL is an extension of the SQL language which allows you to tackle advanced querying and query-tuning challenges in SQL Server and Azure SQL Database. This book will be a perfect reference for you to write more efficient T-SQL code to perform simple-to-advanced tasks for data management and data analysis.

BookMay 2019484 pages

SQL Server 2017 Administrator's Guide

This book will give you all the information you need to become an expert database administrator, and master the administrative aspects of SQL Server 2017. From setting up and configuring your SQL Server instance to fine-tuning your database, this extensive guide will teach you the nitty-gritty of SQL Server 2017 administration.

BookDec 2017434 pages

SQL Server 2019 Administrator's Guide

This book will give you all the information you need to become an expert database administrator and master the administrative aspects of SQL Server 2019. From setting up and configuring your SQL Server instance to fine-tuning your database, this extensive guide will teach you the nitty-gritty of SQL Server 2019 administration.

BookSep 2020522 pages

SQL Server 2017 Integration Services Cookbook

SQL Server Integration Services is a tool that facilitates data extraction, consolidation, and loading options (ETL), SQL Server coding enhancements, data warehousing, and customizations. With the help of this book, you’ll gain complete hands-on experience of SSIS 2017’s new features, and design and development improvements including SCD, Profiling, Tuning, and Customizations.

BookJun 2017558 pages

Hands-On Machine Learning with Azure

This book will teach you how advanced machine learning can be performed in the cloud in a very cheap way. You will learn more about Azure ML processes as an enterprise-ready methodology. By the end of this book, you will implement machine learning and artificial intelligence concepts in your model to solve real-world problems.

BookOct 2018340 pages

SQL Server on Linux

Microsoft's launch of SQL Server on Linux has made SQL Server a truly versatile platform across different operating systems and data-types, both on-premise and on-cloud. You will start by understanding how SQL Server can be installed on supported and unsupported Linux distributions. With the help of this book you will be able to setup SQL Server on Linux and understand how SQL Server can be installed and implemented on this open source platform.

BookAug 2017222 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages