Reader small image

You're reading from  Practical Predictive Analytics

Product typeBook
Published inJun 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785886188
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Ralph Winters
Ralph Winters
author image
Ralph Winters

Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customerretention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models. He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists. Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA. Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable. Ralph's web site can be found at ralphwinters.com
Read more about Ralph Winters

Right arrow

Chapter 7. Using Market Basket Analysis as a Recommender Engine

"It's not wise to violate the rules until you know how to observe them."

- T.S. Eliot

In this chapter, we will cover the following topics:

  • Market basket analysis using the arules package
  • Data transformation and cleaning techniques using semi-structured market basket transaction data
  • Learn how to transform transaction objects into dataframes
  • Use cluster analysis for prediction using the flexclus package
  • Utilize some text mining using RTextTools and tm packages

What is market basket analysis?


If you have survived the last chapter, you will now be introduced to the world of market basket analysis (MBA). Market basket analysis (also sometimes called affinity analysis), is a predictive analytics technique that is used heavily in the retail industry in order to identify baskets of items that are purchased together. The typical use case for this is the supermarket shopping cart in which a shopper would typically purchase an assortment of items such as milk, bread, cheese, and so on, and the algorithm will predict how purchasing certain items together will affect the purchase of other items. It is one of those methods that retailers use to know to start sending you coupons and emails for things that you didn't know you needed!

One often quoted example of MBA is the relationship between diapers and beer:

"One super market chain discovered in its analysis that customers that bought diapers often bought beer as well, have put the diapers close to beer coolers...

Examining the groceries transaction file


Critical to the understanding of MBA are the concepts of support, confidence, and lift. These are the measures that evaluated the goodness of fit for a set of association rules. You will also learn some specific definitions that are used in MBA, such as consequence, antecedent, and itemsets.

To introduce these concepts, we will first illustrate these terms through a very simplistic example. We will use only the first 10 transactions contained in the Groceries transaction file, which is contained in the arules package:

library(arules) 

After the arules library is loaded, you can see a short description of the Groceries dataset by entering ?Groceries at the command line. The following description appears in the help window:

"The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories".

For more information...

The sample market basket


Each transaction numbered 1-10 listed previously represents a basket of items purchased by a shopper. These are typically all items that are associated with a particular transaction or invoice. Each basket is enclosed within braces {}, and is referred to as an itemset. An itemset is a group of items that occur together.

Market basket algorithms construct rules in the form of:

Itemset{x1,x2,x3 ...} --> Itemset{y1,y2,y3...}. 

This notation states that buyers who have purchased items on the left-hand side of the formula (lhs) have a propensity to purchase items on the right-hand side (rhs). The association is stated using the à symbol, which can be interpreted as implies.

Note

The lhs of the notation is also known as the antecedent, and the rhs is known as the consequence. If nothing appears on either the left-hand side or right-hand side there is no specific association rule for those items; however, it also means that those items have appeared in the basket.

Association rule algorithms


Without an association rule algorithm, you are left with the computationally very expensive task of generating all possible pairs of itemsets, and then trying to mine the data in order to identify the best ones yourself. Associate rule algorithms help with filtering this.

The most popular algorithm for MBA is the apriori algorithm, which is contained within the arules package (the other popular algorithm is the eclat algorithm).

Running apriori is fairly simple. We will demonstrate this using our demo 10 transaction itemset that we just printed.

The apriori algorithm is based upon the principle that if a particular itemset is frequent, then all of its subsets must also be frequent. That principle itself is helpful for reducing the number of itemsets that need to be evaluated, since it only needs to look at the largest items sets first, and then be able to filter down:

  • First, some housekeeping. Fix the number of printable digits to 2:
         options(digits = 2)
  • Next...

Antecedents and descendants


The rules shown previously are expressed as an implication between the antecedent (left-hand side) and the consequence (right-hand side).

The first rule, describes customers who buy a bottle of water also buying tropical fruit. The third rule states that customers who buy cereals have a tendency to buy whole milk.

Evaluating the accuracy of a rule


Three main metrics have been developed that measure the importance, or accuracy of an association rule: support, confidence, and lift.

Support

Support measures how frequently the items occur together. Imagine having a shopping cart in which there can be a very large number of combinations of items. Some items that occur rarely could be excluded from the analysis. When an item occurs frequently you will have more confidence in the association among the items, since it will be a more popular item. Often your analysis will be centered around items with high support.

Calculating support

Calculating support is simple. You first calculate a proportion by counting the number of times that the items in the rule appear in the basket divided by the total number of occurences in the itemsets:

Examples

  • We can see that for the first rule (index #63), {bottled water} and {tropical fruit} appear together in the same transaction in two different transactions (2 and 3), therefore...

Preparing the raw data file for analysis


Now that we have had a short introduction to the association rules algorithm, we will illustrate applying association rules to a more meaningful example.

We will be using the online retail dataset, which can be obtained from the UCI machine learning repository at:

https://archive.ics.uci.edu/ml/datasets/Online+Retail.

As described by the source, the data is:

"A transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers".

For more information about how the dataset was created, please refer to the original journal article (Daqing Chen, 2012).

Reading the transaction file

We will input the Groceries data using the read.csv() function.

We can use the file.show() function to directly examine the input file if needed. This is sometimes needed if you find that there are...

Analyzing the input file


After reading in the file, the nrow() function shows that the transaction file contains 541909 rows:

nrow(OnlineRetail)

This is the following output:

> [1] 541909 

We can use our handy View() function to peruse the contents. Alternatively, you can use the kable() function from the knitr library to display a simple tabular display of the dataframe in the console, as indicated later.

Look at the first few records. The kable() function will attempt to fit a simple table in the space providing, and will also truncate any long strings:

kable(head(OnlineRetail)) 

We can still see the last column is truncated (United Kingdom), but all of the columns fit without wrapping to the next line:

Note

Using an R Notebook with the kable() function. Note that when using the Rmarkdown package, or an R Notebook in RStudio, the output from the kable() function can be formatted to appear as an HTML table in the markdown file. Otherwise, it will appear as plain ASCII text. For example, you may...

Scrubbing and cleaning the data


Here comes the cleaning part!

Print some of the groceries contained within the description field of OnlineRetail:

kable(OnlineRetail$Description[1:5],col.names=c("Grocery Item Descriptions")) 
|Grocery Item Descriptions                 |  
|:-----------------------------------------| 
|WHITE HANGING HEART T-LIGHT HOLDER        | 
|METAL METAL LANTERN                       | 
|CREAM CUPID HEARTS COAT HANGER            | 
|KNITTED UNION FLAG HOT WATER BOTTLE       | 
|RED WOOLLY HOTTIE WHITE HEART.            | 

Although each line contains a separate grocery item, the items are in a uniform format, that is, the number of words describing each item can vary, and some words are adjectives and some are nouns. Additionally, the retailer may deem certain words to be irrelevant to a particular marketing campaign (such as colors, or sizes, which may be standard across all products). This type of data can be referred to as semi-structured data, since it incorporates certain...

Removing colors automatically


If you did not want to bother specifying colors, and you wanted to remove colors automatically, you could accomplish that as well.

The colors() function

The colors() function returns a list of colors that are used in the current palette. We can then perform a little code manipulation in conjunction with the gsub() function that we just used to replace all of the specified colors from OnlineRetail$Description with blanks.

We will also use the kable() function, which is contained within the knitr package, in order to produce simple HTML tables of the results:

# compute the length of the field before changes
 before <- sum(nchar(OnlineRetail$Description))

 # get the unique colors returned from the colors function, and remove any digits found at the end of the string

 # get the unique colors
 col2 <- unique(gsub("[0-9]+", "", colors(TRUE)))


 #Now we will filter out any colors with a length > 7. This number is somewhat arbitrary but it is just done for illustration...

Filtering out single item transactions


Since we will want to have a basket of items to perform some association rules on, we will want to filter out the transactions that only have one item per invoice. That might be useful for a separate analysis of customers who only purchased one item, but it does not help with finding associations between multiple items, which is the goal of this exercise.

  • Let's use sqldf to find all of the single item transactions, and then we will create a separate dataframe consisting of the number of items per customer invoice:
        library(sqldf) 
  • First construct a query: How many distinct invoices were there? We see that there were 25900 separate invoices:
        sqldf("select count(distinct InvoiceNo) from   
        OnlineRetail") 
        > Loading required package: tcltk 
        >   count(distinct InvoiceNo)
        > 1                     25900 
  • How many invoices contain only single transactions? First, extract the single item invoices:
        single...

Merging the results back into the original data


We will want to retain the number of total items for each invoice on the original data frame. That will involve joining the number of items contained in each invoice back to the original transactions, using the merge() function, and specifying Invoicenum as the key.

If you count the number of distinct invoices before and after the merge, you can see that the invoice count is lower than prior to the merge:

#first take a 'before' snapshot 
 
nrow(OnlineRetail) 
> [1] 541909 
 
#count the number of distinct invoices 
 
sqldf("select count(distinct InvoiceNo) from OnlineRetail")  

The output shows a total of 25900 distinct invoices:

>   count(distinct InvoiceNo) 
> 1                     25900  

Now merge the counts back with the original data:

OnlineRetail <- merge(OnlineRetail, x2, by = "InvoiceNo") 

Check the new number of rows, and the new count of distinct invoices (20059 versus 25900). Note these counts compared to the original. The reduction...

Compressing descriptions using camelcase


For long descriptions, sometimes it is beneficial to compress them into camelcase to improve readability. This is especially valuable when viewing descriptions that are labels on x or y axes.

Camelcase is a method that some programmers use for writing compound words, where spaces are first removed, and then each word begins with a capital letter. It is also a way of conserving space.

To accomplish this, we can write a function called .simpleCap, which performs this function. To illustrate how it works, we will pass it a two element character vector c("A certain good book","A very easy book"), and observe the results.

Custom function to map to camelcase

This is a simple example use of this function that maps the two character vector c("A certain good book", "A very easy book") to camelcase. This vector is mapped to two new elements:

[1] "ACertainGoodBook", and  [2] "AVeryEasyBook" 
 
# change descriptions to camelcase maybe append to itemnumber for uniqueness...

Creating the test and training datasets


Now that we are finished with our transformations, we will create the training and test data frames. We will perform a 50/50 split between training and test:

# Take a sample of full vector
nrow(OnlineRetail) 
> [1] 536068 
pctx <- round(0.5 * nrow(OnlineRetail))
set.seed(1)

# randomize rows

df <- OnlineRetail[sample(nrow(OnlineRetail)), ]
rows <- nrow(df)
OnlineRetail <- df[1:pctx, ]  #training set
OnlineRetail.test <- df[(pctx + 1):rows, ]  #test set
rm(df)

# Display the number of rows in the training and test datasets.

nrow(OnlineRetail) 
> [1] 268034 
nrow(OnlineRetail.test) 
> [1] 268034 

Saving the results

It is a good idea to periodically save your data frames, so that you can pick up your analysis from various checkpoints.

In this example, I will first sort them both by InvoiceNo, and then save the test and train data sets to disk, where I can always load them back into memory as needed:

 

setwd("C:/PracticalPredictiveAnalytics...

Creating the market basket transaction file


We are almost there! There is an extra step that we need to do in order to prepare our data for market basket analysis.

The association rules package requires that the data be in transaction format. Transactions can either be specified in two different formats:

  1. One transaction per itemset with an identifier and this shows the entire basket in one line, just as we saw with the Groceries data.
  2. One single item per line with an identifier.

Additionally, you can create the actual transaction file in two different ways, by either:

  1. Physically writing a transactions file.
  2. Coercing a dataframe to transaction format.

For smaller amounts of data, coercing the dataframe to a transaction file is simpler, but for large transaction files, writing the transaction file first is preferable, since append files can be fed from large operational transaction systems. We will illustrate both ways.

Method one Coercing a dataframe to a transaction file

Now we are ready to coerce...

Method two Creating a physical transactions file


Now that you know how to run association rules using the coerce to dataframe method, we will now illustrate the write to file method:

  • In the write to file method, each item is written to a separate line, along with the identifying key, which in our case is the InvoiceId

  • The advantage to the write to file method is that very large data files can be accumulated separately, and then combined together if needed

  • You can use the file.show function to display the contents of the file that will be input to the association rules algorithm:

setwd("C:/PracticalPredictiveAnalytics/Data")
load("OnlineRetail.full.Rda")
OnlineRetail <- OnlineRetail[1:100,]
nrow(OnlineRetail)
> [1] 268034 
head(OnlineRetail) 
> InvoiceNo StockCode  Description                Quantity
 > 5   6365     71053  METAL LANTERN                     6
 > 6   536365   21730  GLASS STAR FROSTED T-LIGHT HOLDER 6
 > 2   536365   22752  SET 7 BABUSHKA NESTING BOXES      2...

Converting to a document term matrix


Once we have a corpus, we can proceed to convert it to a document term matrix. When building DTM, care must be given to limiting the amount of data and resulting terms that are processed. If not parameterized correctly, it can take a very long time to run. Parameterization is accomplished via the options. We will remove any stopwords, punctuation, and numbers. Additionally, we will only include a minimum word length of four:

library(tm)
 dtm <- DocumentTermMatrix(corp, control = list(removePunctuation = TRUE, wordLengths = c(4, 
 999), stopwords = TRUE, removeNumbers = TRUE, stemming = FALSE, bounds = list(global = c(5, 
 Inf))))

We can begin to look at the data by using the inspect() function.

This is different from the inspect() function in an arules package, and if you have the arules package loaded, you will want to preface this inspect with tm::inspect:

inspect(dtm[1:10, 1:10]) > <<DocumentTermMatrix (documents: 10, terms: 10)>>
 >...

K-means clustering of terms


Now we can cluster the term document matrix using k-means. For illustration purposes, we will specify that five clusters be generated:

kmeans5 <- kmeans(dtms, 5)

Once k-means is done, we will append the cluster number to the original data, and then create five subsets based upon the cluster:

kw_with_cluster <- as.data.frame(cbind(OnlineRetail, Cluster = kmeans5$cluster))

 # subset the five clusters
 cluster1 <- subset(kw_with_cluster, subset = Cluster == 1)
 cluster2 <- subset(kw_with_cluster, subset = Cluster == 2)
 cluster3 <- subset(kw_with_cluster, subset = Cluster == 3)
 cluster4 <- subset(kw_with_cluster, subset = Cluster == 4)
 cluster5 <- subset(kw_with_cluster, subset = Cluster == 5)

Examining cluster 1

Print out a sample of the data:

> head(cluster1[10:13])
 Desc2 lastword firstword Cluster
 50 VintageBillboardLove/hateMug MUG VINTAGE 1
 86 BagVintagePaisley PAISLEY BAG 1
113 ShopperVintagePaisley PAISLEY SHOPPER 1
145 ShopperVintagePaisley...

Predicting cluster assignments


The goal in this exercise is to score the test dataset, by assigning clusters based upon the predict method for the training dataset.

Using flexclust to predict cluster assignment

The standard kmeans function does not have a prediction method. However, we can use the flexclust package which does. Since the prediction method can take a long time to run, we will illustrate it only on a sample number of rows and columns. In order to compare the test and training results, they also need to have the same number of columns. For illustration purposes, we will set the number at 10.

To begin, take a sample from the OnlineRetail training data:

set.seed(1)
 sample.size <- 10000
 max.cols <- 10

library("flexclust") OnlineRetail <- OnlineRetail[1:sample.size, ]

Next, create the document term matrix from the description column in the sampled dataset. We will use the create_matrix function from the RTextTools package, which can create a TDM first without having a separate...

Running the apriori algorithm on the clusters


Circling back to the apriori algorithm, we can use the predicted clusters that were generated instead of lastword, in order to develop some rules:

  • We will use the coerce to dataframe method to generate the transaction file as previously generated

  • Create a rules_clust object, which builds association rules based upon the itemset of clusters {1,2,3,4,5}

  • Inspect some of the generated rules by lift:

        library(arules)
        colnames(kw_with_cluster2_score)   
        kable(head(kw_with_cluster2_score[,c(1,13)],5))
        tmp <-    
        data.frame(kw_with_cluster2_score[,1],
        kw_with_cluster2_score[,13])
        names(tmp) [1] <- "TransactionID" 
        names(tmp) [2] <- "Items"
        tmp <- unique(tmp)
        trans4 <- as(split(tmp[,2], tmp[,1]), "transactions")   rules_clust
        <- apriori(trans4,parameter =    list(minlen=2,support =
        0.02,confidence = 0.01))    summary(rules_clust)
        tmp <...

Summarizing the metrics


Running a summary on the rules_clust object indicates an average support of 0.05, and average confidence of 0.43.

This demonstrates that using clustering can be a viable way to develop association rules, and reduce resources and the number of dimensions at the same time:

    support          confidence           lift     
 Min.   :0.02044   Min.   :0.09985   Min.   :0.989 
 1st Qu.:0.02664   1st Qu.:0.19816   1st Qu.:1.006 
 Median :0.03066   Median :0.27143   Median :1.526 
 Mean   :0.05040   Mean   :0.43040   Mean   :1.608 
 3rd Qu.:0.04234   3rd Qu.:0.81954   3rd Qu.:1.891 
 Max.   :0.17080   Max.   :1.00000   Max.   :3.022 

References


  • Daqing Chen, S. L. (2012). Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3.
  • Michael Hahsler, K. H. (2006). Implications of probabilistic data modeling for mining association rules. In R. K. M. Spiliopoulou, Data and Information Analysis to Knowledge.
  • Engineering, Studies in Classification, Data Analysis, and Knowledge Organization (pp. 598-605). Springer-Verlag.

Summary


In this chapter, we learned about a specific type of recommender engine, under the umbrella term market basket analysis.

We saw that market basket analysis enabled you to mine large quantities of transactions containing semi-structured data to derive association rules among the itemsets contained in each basket.

Some additional data cleaning techniques were used on the market basket data, in order to standardize and consolidate some of the descriptions of the purchased items. We also learned how to isolate the most powerful rules, using plotting techniques, along with metrics such as lift, support, and confidence.

Finally, we showed you how to generate clusters from your market basket data training data, and to predict cluster assignments based upon a test data set.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Predictive Analytics
Published in: Jun 2017Publisher: PacktISBN-13: 9781785886188
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ralph Winters

Ralph Winters started his career as a database researcher for a music performing rights organization (he composed as well!), and then branched out into healthcare survey research, finally landing in the Analytics and Information technology world. He has provided his statistical and analytics expertise to many large fortune 500 companies in the financial, direct marketing, insurance, healthcare, and pharmaceutical industries. He has worked on many diverse types of predictive analytics projects involving customerretention, anti-money laundering, voice of the customer text mining analytics, and health care risk and customer choice models. He is currently data architect for a healthcare services company working in the data and advanced analytics group. He enjoys working collaboratively with a smart team of business analysts, technologists, actuaries as well as with other data scientists. Ralph considered himself a practical person. In addition to authoring Practical Predictive Analytics for Packt Publishing, he has also contributed two tutorials illustrating the use of predictive analytics in Medicine and Healthcare in Practical Predictive Analytics and Decisioning Systems for Medicine: Miner et al., Elsevier September, 2014, and also presented Practical Text Mining with SQL using Relational Databases, at the 2013 11th Annual Text and Social Analytics Summit in Cambridge, MA. Ralph resides in New Jersey with his loving wife Katherine, amazing daughters Claire and Anna, and his four-legged friends, Bubba and Phoebe, who can be unpredictable. Ralph's web site can be found at ralphwinters.com
Read more about Ralph Winters