Introduction to R for Business Intelligence: Profit optimization using data mining, data analysis, and Business Intelligence

What do you get with Print?

Instant access to your digital copy whilst your Print order is Shipped

Paperback book shipped to your preferred address

Redeem a companion digital copy on all Print orders

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Introduction to R for Business Intelligence

Chapter 1. Extract, Transform, and Load

Business may focus on profits and sales, but business intelligence (BI) focuses on data. Activities reliant on data require the business analyst to acquire it from diverse sources. The term Extract, Transform, and Load, commonly referred to as ETL, is a deliberate process to get, manipulate, and store data to meet business or analytic needs. ETL is the starting point for many business analytic projects. Poorly executed ETL may affect a business in the form of added cost and lost time to make decisions. This chapter covers the following four key topics:

Understanding big data in BI analytics
Extracting data from sources
Transforming data to fit analytic needs
Loading data into business systems for analysis

This chapter presents each ETL step within the context of the R computational environment. Each step is broken down into finer levels of detail and includes a variety of situations that business analysts encounter when executing BI in a big data business world.

Understanding big data in BI analytics

Before we begin describing the ETL process, consider its importance in business intelligence. CIO Magazine provides a popular and useful definition of BI (Mulcahy, 2007):

"Business intelligence, or BI, is an umbrella term that refers to a variety of software applications used to analyze an organization's raw data. BI as a discipline is made up of several related activities, including data mining, online analytical processing, querying and reporting."

Mulcahy captures the essence of this book, which presents solutions in R to walk you through the steps from data analytic techniques to communicating your results. The purpose of BI applications has changed over the last decade as big data challenges affect the business world in ways first experienced in the sciences decades ago.

You can find the term big data in many business settings. It appears in advertisements for boot camps, draws attendees to conferences, and perplexes business leaders. Arguably, the term is ill-defined. A 1998 presentation given by John Mashey, then the Chief Scientist of Silicon Graphics, is often cited as the document that introduced the term (Press, 2013). The impact of big data on business is undeniable, despite its elusive meaning. There is a general agreement on the following three characteristics of big data, called the 3Vs:

Volume: The size of datasets has grown from megabytes to petabytes
Velocity: The speed of data arrival has changed to near real time
Variety: The sources of data have grown from structured databases to unstructured ones, such as social media, websites, audio, and video

Together these three characteristics pose a growing challenge to the business community. Data is stored in facilities across a vast network of local servers or relational databases. Virtual software access it with cloud-based applications. BI applications have typically included static dashboards based on fixed measures using structured data. Big data changes the business by affording a competitive advantage to those who can extract value from the large and rapidly changing sources of diverse data.

Today, people ask business analysts, what is going to happen? To answer this type of question, a business needs tools and processes to tap into the growing stream of data. Often this data will not fit into the existing databases without transformation. The continual need to acquire data requires a structured ETL approach to wrangle the unstructured nature of modern data. As you read this chapter, think about how companies may benefit from using the techniques presented, even when they are less complex than big data.

Note

Use case: Bike Sharing, LLC

You will begin your exploration of BI and analytics through the lens of a fictional business called Bike Sharing, LLC. The company operates and maintains a fleet of publically rental bikes in the Washington D.C. metropolitan area. Their customers are typically from the urban area, including people from business, government, and universities. Customers enjoy the convenience of finding bikes easily within a network of bike-sharing stations throughout the city. Renters may rent a bicycle at one location and leave it at another station.Bike Sharing, LLC started operations in 2011, and has enjoyed continued growth. They quickly established a BI group to keep track of the data collected about transactions, customers, and factors related to rentals, such as weather, holidays, and times of day. In 2014, they began to understand how they might use open source datasets to guide decisions regarding sales, operations, and advertising. In 2015, they expanded their BI talent pool with business analysts experienced with R and statistical methods that could use Bike Sharing data in new ways.

You joined Bike Sharing just a few months ago. You have a basic understanding of R from the many courses and tutorials that you used to expand your skills. You are working with a good group that has a diverse skillset, including programming, databases, and business knowledge. The first data you have been given is bike rental data covering the two-year period from Jan 1, 2011 to Dec 31, 2012 (Kaggle, 2014). You can download this same Ch1_bike_sharing_data.csv file from the book's website at http://jgendron.github.io/com.packtpub.intro.r.bi/.

Data sources often include a data dictionary to help new users understand the contents and coding of the data. Data Dictionary for Bike Sharing Data (Kaggle, 2014):

datetime: Hourly date + timestamp
season: 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday: Whether the day is considered a holiday
workingday: Whether the day is neither a weekend nor holiday
weather:
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pellets + Thunderstorm + Mist, Snow + Fog
temp: Temperature in Celsius
atemp: Feels like temperature in Celsius
humidity: Relative humidity
windspeed: Wind speed
casual: Number of non-registered user rentals initiated
registered: Number of registered user rentals initiated
count: Number of total rentals

One of your goals is to strengthen your ETL skills. In this use case, you will learn common extraction, transformation, and loading skills to store a dataset in a file for analysis. Welcome to the Bike Sharing team.

Extracting data from sources

Now we get to work. The aim of this book is to design, develop, and deliver a business intelligence product-a data product. In this section, you will explore two extraction methods to import data from different types of sources:

Importing CSV and other file formats
Importing data from relational databases

Depending on your background, you may be more or less familiar with both types of extraction methods. Here, you will learn or refresh your knowledge of both in order to have a more complete working understanding of ETL.

Importing CSV and other file formats

You can load the Bike Sharing data file into the R environment by using the read.csv() function. It is an easy and commonly used way to load CSV files:

bike <- read.csv("{filename}.csv")

Calling this function will read the file as long as it is located in your R working directory. The working directory is the space R uses, much like a home directory. There are two commands that you can use to check and change your working directory:

getwd(): This will return the current working directory as a path in the R console
setwd(<path>): This is used in the console to change the working directory to <path> you pass in the function

If you try to read a file that is not in your working directory, you will see an error message. Some analysts manage data in a separate data directory, one level under their working directory. In this case, you can add a path in front of the filename. The following example shows how the data file is located one layer down in a data directory. Adding the ./data/ string to the front of the filename will allow R to access the data:

bike <- read.csv("./data/Ch1_bike_sharing_data.csv") 
str(bike)

The str() function is not required to import the data, but it does provide a confirmation that the data was read into the R environment. It also provides you with a quick look at the structure of the dataset, its dimensions, and the names of the variables:

'data.frame':  17379 obs. of  12 variables:
 $ datetime  : Factor w/ 17379 levels "1/1/2011 0:00",..: 1 2 13  ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...

The read.csv() function uses default parameters that are consistent with CSV files, such as using a comma (,) as a separator. The R environment has a more flexible read option for instances when you may have tab-delimited data or TXT file. This option is the read.table() function:

bike <- read.table("./data/Ch1_bike_sharing_data.csv", sep = ",", 
                   header = TRUE)

This function performs identically to the read.csv() function. In fact, read.csv() is built on read.table() and uses sep = "," and header = TRUE as default parameters.

Tip

Downloading the example code

Detailed steps to download the code bundle are mentioned in the Preface of this book. Please have a look.

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Introduction-To-R-For-Business-Intelligence . We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/ . Check them out!

Importing data from relational databases

You can also use R to access the data stored in many relational databases through the Open Database Connectivity (ODBC) application programming interface. In R, you do this using the RODBC package. This package provides you with an advantage for legacy data. By establishing an R connection to relational databases at Bike Sharing, LLC, you can reuse existing SQL-type queries that may already exist in the BI group.

Imagine that the Bike Sharing dataset was not a CSV file, but it was instead stored in a relational database. You can access the data by loading the RODBC package and establishing a connection to the database using the following parameters: data source name, user ID, and password. In this notional example, the user Paul would like to access the data source ourDB using his password R4BI:

library(RODBC) 
connection <- odbcConnect(dsn = "ourDB", uid = "Paul", pwd = "R4BI")

Note

As you follow along with this code, you will only be able to run the odbcConnect() function if you have configured a database source. Advanced users may try this using an Excel file. A tutorial for Windows-based environments is provided by Cyberfella LTD (2013) at http://www.cyberfella.co.uk/2012/06/11/windows7-odbc/ .

Having established a connection to your database, you can use R to write a SQL query, pass this query to the sqlQuery() function, and read the data into a bike data frame. You learned that there is a table in the database called marketing:

query <- "SELECT * FROM marketing" 
bike <- sqlQuery(connection, query) 
close(connection)

The close() function closes the connection between R and the database. Doing this will free any memory consumed by the open connection. It is a good practice to follow.

Tip

R tip: One thing you can do when establishing connections is to add a log to record your ETL processes. In the event something goes wrong, the log will capture the issue and help you find the problem quickly. The log4r package is specifically written for this application. For more information, visit the repository at https://github.com/johnmyleswhite/log4r .

There are also non-relational databases in the work place. Non-relational databases store documents, key value data, and unstructured data. Unlike the relational database approach, they contain different data structures grouped into nodes, clusters, or other schemas. The R community has created various packages to connect with non-relational databases and distributive filing systems. You can connect to MongoDB using the rmongodb package and to Spark distributed computing using the SparkR package.

Tip

BI tip: By looking at the database approach used within an organization and conceptualizing the data, a business analyst can better understand business processes to improve the quality of analysis and meet organizational needs.

Transforming data to fit analytic needs

In the previous section, you learned how to extract data and import it into R from various sources. Now you can transform it to create subsets of the data. This is useful to provide other team members with a portion of the data they can use in their work without requiring the complete dataset. In this section, you will learn the following four key activities associated with transformation:

Filtering data rows
Selecting data columns
Adding a calculated column from existing data
Aggregating data into groups

You will learn how to use functions from the dplyr package to perform data manipulation. If you are familiar with SQL, then dplyr is similar in how it filters, selects, sorts, and groups data. If you are not familiar with SQL, do not worry. This section will introduce you to the dplyr package. Learn more about the dplyr package by typing browseVignettes(package = "dplyr") into your R console.

Note

Request from marketing: Marketing would like an extract from the data with revenues during the spring and summer seasons from days when only casual users rent bikes. They want a small CSV file with just the number of casual renters and revenue grouped by season.

Filtering data rows

You will filter rows with dplyr using the filter() function to extract a subset of rows that meet the criteria defined with logical operators such as those shown in the following table (RStudio, 2015). Read more about this by typing ?Comparison or ?base::Logic in the R console:

You can use these operators to pass a criterion, or many criteria, into your filter. Marketing would like to know how many times during spring or summer that only casual users rented bikes. You can begin creating a subset of the data by using the filter() function along with the == operator and the or Boolean ( | ). Place the results in a temporary extracted_rows data frame:

library(dplyr) 
extracted_rows <- filter(bike, registered == 0,  
                         season == 1 | season == 2) 
dim(extracted_rows)

We get the following output:

[1] 10 12

The dim() function call shows only 10 observations meet the filter criteria. This demonstrates the power of filtering larger datasets.

There are various ways of transforming the data. You can create an identical dataset using the %in% operator. This operator looks at each row (observation) and determines whether it is a member of the group based on criteria you specify. The first parameter is the name of the data frame, the second and successive parameters are filtering expressions:

using_membership <- filter(bike, registered == 0, season %in% c(1, 2)) 
identical(extracted_rows, using_membership)

We get the output as follows:

[1] TRUE

The identical() function compares any two R objects and returns TRUE if they are identical and FALSE otherwise. You created a subset of data by filtering rows and saving it in a separate data frame. Now you can select columns from that.

Selecting data columns

The select() function extracts columns you desire to retain in your final dataset. The marketing team indicated they were only interested in the season and casual renters. They did not express interest in environmental conditions or holidays. Providing team members data products that meet their specification is an important way to sustain relationships.

You can extract the required columns from extracted_rows and save these in another temporary extracted_columns data frame. Pass the select() function, data frame, and names of the columns to extract, season and casual:

extracted_columns <- select(extracted_rows, season, casual)

The following table provides a view of the first two observations from the subset data frame you generated. You notice that there is something missing. Marketing wants to know the number of casual renters and revenue by season. There is no revenue variable in the data you are using. What can you do about this? You can add a column to the data frame, as described in the following section:

Adding a calculated column from existing data

For your particular situation, you will be adding a calculated column. You asked marketing about the structure of rental costs and learn that casual renters pay five dollars for a day pass. You figured out that all you have to do is multiply the number of casual renters by five to get the revenues for each day in your data frame.

The mutate() function will calculate and add one or more columns, depending on the parameters. Your parameters include the data frame and an expression indicating the name of the new column and the calculation to create the revenue:

add_revenue <- mutate(extracted_columns, revenue = casual * 5)

The output will be as follows:

Adding a calculated column from existing data

Perfect! You are nearly done. All you have left to do is to group and summarize the data into a final data frame.

Aggregating data into groups

The dplyr package provides you the group_by() and summarise() functions to help you aggregate data. You will often see these two functions used together. The group_by() function takes the data frame and variable on which you would like to group the data as parameters, in your case, season:

grouped <- group_by(add_revenue, season)

The summarise() function takes the data frame and all the variables you want to summarize as parameters. This also requires you to specify how you would like to summarize them. You may chose an average, minimum, maximum, or sum. Marketing wants to know total rentals and revenue by season, so you will use the sum() function:

report <- summarise(grouped, sum(casual), sum(revenue))

We get the output as follows:

This looks like it is what the marketing group wants. Now you have to deliver it.

Tip

R tip: There are many other transformation functions provided by the dplyr package. You can type ??dplyr in your R console to get more information.

Loading data into business systems for analysis

You have imported and transformed data. It resides within your R environment as a data frame. Now you will need to provide it to marketing. The following are two common ways to export data from R into a file for use elsewhere in an organization:

Writing data to a CSV file
Writing data to a tab-delimited text file

These methods are similar, but they produce different results. Knowing about them and their differences will help you decide the format you would like to use.

Writing data to a CSV file

CSV files are common among data applications. Other data applications, such as Excel, can read these types of file. CSV files are also useful because database systems can typically import them into their environment, just as you imported a CSV into the R environment. The write.csv() function is used to write a data frame to a CSV file. In this example, the input parameters include report and the name of the output file, revenue_report.csv:

write.csv(report, "revenue_report.csv", row.names = FALSE)

You also used a row.names = FALSE parameter. Very often, your dataset will not contain row names. This parameter prevents R from adding a column of numerical identifiers to the CSV file. There are many other parameters you can use with write.csv(). Learn more about them by typing ?write.csv in the R console.

Writing data to a tab-delimited text file

There may be times when you would like to have your data read by a data application that does not import CSV files. Recall that in the Extracting data from sources section, that read.csv() had a more flexible counterpart, read.table(). The write.table() function provides you with greater flexibility on how the final file is composed:

write.table(report, "revenue_report.txt", row.names = FALSE, sep = "\t")

The write.table() function uses a syntax that is very similar to write.csv(). You see the addition of sep = "\t". This tells R to separate data with the tab character when creating the text file. There are many other parameters you can use with write.table(). Learn more about them by typing ?write.table in the R console.

Key benefits

Use this easy-to-follow guide to leverage the power of R analytics and make your business data more insightful.

This highly practical guide teaches you how to develop dashboards that help you make informed decisions using R.

Learn the A to Z of working with data for Business Intelligence with the help of this comprehensive guide.

Description

Explore the world of Business Intelligence through the eyes of an analyst working in a successful and growing company. Learn R through use cases supporting different functions within that company. This book provides data-driven and analytically focused approaches to help you answer questions in operations, marketing, and finance. In Part 1, you will learn about extracting data from different sources, cleaning that data, and exploring its structure. In Part 2, you will explore predictive models and cluster analysis for Business Intelligence and analyze financial times series. Finally, in Part 3, you will learn to communicate results with sharp visualizations and interactive, web-based dashboards. After completing the use cases, you will be able to work with business data in the R programming environment and realize how data science helps make informed decisions and develops business strategy. Along the way, you will find helpful tips about R and Business Intelligence.

What you will learn

* Extract, clean, and transform data

* Validate the quality of the data and variables in datasets

* Learn exploratory data analysis

* Build regression models

* Implement popular data-mining algorithms

* Visualize results using popular graphs

* Publish the results as a dashboard through Interactive Web Application frameworks

What do you get with Print?