SQL Server 2016 came with great new features, and among them was R integration into SQL Server, partly with advanced analytics and partly with new programmability capabilities. Microsoft R Services for SQL Server is part of the family of new extensibilities for highly scalable and parallel advanced analytics. R Services allows you to perform advanced analytics (statistical, multivariate statistics, predictive analytics, machine learning, and deep learning) on large quantities of data stored in the database. Microsoft published R Services as part of Microsoft R Server (MRS), which was specially designed for reading data directly from the SQL Server database within the same SQL Server computational context.
We will cover the following aspects in this chapter:
- Using R prior to SQL Server 2016
- Microsoft's commitment on open source R language
- Boosting analytics with SQL Server R integration
- Outline of the book
The R language has been in the community since the 90's (even though it was developed a decade before). With its open source GNU license, R gained popularity for its no-fuss installation and ability to evoke any available package for additional statistical learning functions. This was a clear advantage to R as there were not that many statistical programs available on the market in the '80s and '90s; in addition, most of them were not free. The extensibility with emerging new packages for the core R engine gave a broader community and users more and more abilities to use the R language for multiple purposes, in addition to its strong statistical analysis and predictive modeling capabilities.
SQL Server 2005 introduced SQL Server Analysis Services (SSAS) data mining features to be applied against the customer's existing rich data stored in SQL Server and SSAS OLAP cubes. This feature allows users to use Data Mining eXpression (DMX) for creating predictive queries. In the next couple of years, several questions, requests, and ideas emerged on SQL forums, blogs, and community websites regarding additional statistical and predictive methods and approaches.
Back in 2011, I started working on the idea of extending the capabilities of statistical analysis in SQL Server 2008 R2 with the help of open source R language. One reason for that decision was to have flexibility of running statistical analysis (from data provisioning to multivariate analysis) without feeding the data into OLAP cube first, and another reason was more business orientated, with the need to get faster, statistical insights from all the people involved in data preparing, data munging, and data cleaning.
I kicked in and started working on a framework that was based on a combination of T-SQL stored procedure and R package RODBC (https://cran.r-project.org/web/packages/RODBC). The idea was simple; get the transactional or OLAP data, select the columns you want to perform analysis against, and the analysis itself (from simple to predictive analytics, which would stretch beyond SSAS, T-SQL, or CLR capabilities):
The framework was far from simple, and calling the procedure considered calling a mixture of R code, T-SQL select statements, and configurations to your R engine.
The stored procedure with all its parameters looked like this:
EXECUTE AdventureWorks2012.dbo.sp_getStatistics @TargetTable = '[vStoreWithAddresses]' ,@Variables = 'Name' ,@Statistics = '8' ,@ServerName = 'WORKSTATION-31' ,@DatabaseName = 'AdventureWorks2012' ,@WorkingDirectory = 'C:\DataTK' ,@RPath = 'C:\Program Files\R\R-3.0.3\bin';
The nuts and bolts explanation is outside the scope of this book and is well-documented at: http://www.sqlservercentral.com/articles/R+Language/106760/.
Looking back on this framework and the feedback from the community and people on forums, it was accepted positively and many commented that they needed something similar for their daily business.
The framework in general had, besides pioneering the idea and bringing R engine one step closer to SQL Server, many flaws. The major one was security. Because it needed access to a working local directory for generating R files to be run by the vanilla R engine, it needed xp_cmdshell enabled. The following reconfiguration was mandatory and many sysadmins would not approve of it:
EXECUTE SP_CONFIGURE 'xp_cmdshell', 1; GO RECONFIGURE; GO EXECUTE SP_CONFIGURE 'Ole Automation Procedures', 1; GO RECONFIGURE; GO
In addition, the framework needed to have access to R engine installation, together with R packages to execute the desired code. Installing open source programs and providing read/write access was again a drawback in terms of security and corporate software decisions. Nevertheless, one of the bigger issues—later when everything was installed and put into production—was performance and memory issues. R is memory-based, meaning all the computations are done in the memory. So, if your dataset is bigger than the size of the available memory, the only result you will get will be error messages. Another aspect of performance issues was also the speed. With no parallel and distributive computations, the framework was bound to dexterity of an author of the package. For example, if the package was written in C or C++, rather than in Fortran, the framework performed better, respectively.
The great part of this framework was the ability to deliver results from statistical analysis or predictive modeling much faster, because it could take OLTP or any other data that needed statistical analysis. Furthermore, statisticians and data scientists could prepare the R code that was stored in the table, which was later run by data wranglers, data analysts, or data stewards. Therefore, one version of truth is maintained, because there was no need for data movement or data copying and all users were reading the same data source. In terms of predictive modeling, the framework also enabled users to take advantage of various additional predictive algorithms (for example, decision forest, glm, CNN, SVM, and word cloud) that were not part of SSAS Data Mining at that time.
Besides the pros and cons, the framework was a successful initial attempt to get more data insights that were easily distributable among different business units through pushing visualizations in SQL Server Reporting Services. In the years prior to the release of SQL Server 2016, I had met people from the SQL Server community that developed similar frameworks, in order to push predictions to the SQL Server database to support business applications and solutions. With SQL Server 2016, many such similar solutions were internalized and brought closer to the SQL Server engine to achieve better performance and to address many of the issues and cons.
With a growing popularity and community, R has become and continues to be a big player in the field of advanced analytics and data visualization. R and machine learning servers (or services) are not just buzzword that will be forgotten in the next cycle of SQL Server, but it is infiltrating more and more into different layers of open source and corporate software. In the past five years, many big analytical players have introduced R integration, interpreters, and wrappers for the R language, because of the language's practicality, usability, and inter-disciplinarily and open source orientation. As Microsoft's making a bold and strategic move toward being open source friendly, the use cases for integrating R in SQL Server are growing, making this move even more natural and at the right point in time. This move had been very well appreciated in the SQL community and the business as well.
In comparison to other big analytical tools, Microsoft took integration very seriously. It addressed many of the issues and limitations of the language itself, and created complete integration of R with the SQL Server in order to give the best user experience. Many competitors (such as SAS, IBM, SAP, and Oracle) have done similar integration, but failed to take into account many aspects that contribute to a holistic user experience. Microsoft has announced that joining the R consortium will give them the ability to help the development of the R language and to support future development. In addition, Microsoft has created its own package repository called MRAN (from CRAN, where M stands for Microsoft) and is giving support and SLA agreement for R as well, even though the language and engine is based on Open R (a free, open-sourced version). All these steps tell us how dedicated Microsoft is in bringing an open source, statistical and programming language into the SQL Server environment.
We can only expect more R integration into other services. For example, Power BI supports native R visuals (https://powerbi.microsoft.com/en-us/blog/r-powered-custom-visuals) since October 2016, and R language since December 2015. Therefore, I am a strong believer that R will soon be part of the whole SQL Server ecosystem such as SSAS, SSIS, and SSRS natively as well. With Azure Analysis Services, R is again one step closer to analysis services.
Data science is in the forefront of the SQL Server and R integration. Every task performed by DBA, sysadmin, the analyst, wrangler, or any other role that is working with SQL server can have these tasks supported with any kind of statistics, data correlation, data analysis, or data prediction. R integration should not be restricted only to the fields of data science. Instead, it should be explored and used in all tasks. DBA can gain from R integration by using switching from monitoring tasks to understanding and predicting what might or will happen next. Likewise, this idea can be applied to sysadmin, data wranglers, and so on. R integration also brings different roles of people closer to understand statistics, metrics, measures, and learn how to improve them by using statistical analysis and predictions.
Besides bringing siloed individual teamwork into more coherent and cohesive teams, R integration also brings less data movement, because different users can now—with the help of R code—execute, drill down, and feel the data, instead of waiting to have data first prepared, exported, and imported again. With smoother workflows comes faster time to deployment, whether it is a simple report, a predictive model, or analysis. This allows the boundaries of data ownership to shift into insights ownership, which is a positive aspect of faster reactions to business needs.
In the past year, we have also seen much more interest in data science in Microsoft stack. With R integration, Azure Machine Learning, and Power BI, all users who want to learn new skills and virtues have great starting points from the available products.
Starting with SQL Server 2016, R integration became a very important part of the SQL Server platform. Since the public release of SQL server 2016, until February 2018 (the time of writing this), the community had embraced R as well as Python very well, making data exploration and data analysis part of the general database task. Microsoft addressed many of the issues, and broadened the SQL Server as a product. With SQL Server 2017, Python was added as a secondary analytical language, reaching to an even broader community as well as businesses, and at the same time, taking are of data scalability, performance, and security.
In the next chapter, we will cover different R distributions and IDE tools for using R as a standalone or within the SQL Server, and what the differences among them are when deciding which one to choose.