Reader small image

You're reading from  R for Data Science Cookbook (n)

Product typeBook
Published inJul 2016
Reading LevelIntermediate
Publisher
ISBN-139781784390815
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Yu-Wei, Chiu (David Chiu)
Yu-Wei, Chiu (David Chiu)
author image
Yu-Wei, Chiu (David Chiu)

Yu-Wei, Chiu (David Chiu) is the founder of LargitData (www.LargitData.com), a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems. In addition to being a start-up entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques for data analysis. Yu-Wei is also a professional lecturer and has delivered lectures on big data and machine learning in R and Python, and given tech talks at a variety of conferences. In 2015, Yu-Wei wrote Machine Learning with R Cookbook, Packt Publishing. In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, Packt Publishing. For more information, please visit his personal website at www.ywchiu.com. **********************************Acknowledgement************************************** I have immense gratitude for my family and friends for supporting and encouraging me to complete this book. I would like to sincerely thank my mother, Ming-Yang Huang (Miranda Huang); my mentor, Man-Kwan Shan; the proofreader of this book, Brendan Fisher; Members of LargitData; Data Science Program (DSP); and other friends who have offered their support.
Read more about Yu-Wei, Chiu (David Chiu)

Right arrow

Chapter 9. Rule and Pattern Mining with R

This chapter covers the following topics:

  • Transforming data into transactions

  • Displaying transactions and associations

  • Mining associations with the Apriori rule

  • Pruning redundant rules

  • Visualizing association rules

  • Mining frequent itemsets with Eclat

  • Creating transactions with temporal information

  • Mining frequent sequential patterns with cSPADE

Introduction


The majority of readers will be familiar with Wal-Mart moving beer next to diapers in its stores because it found that the purchase of both products is highly correlated. This is one example of what data mining is about; it can help us find how items are associated in a transaction dataset. Using this skill, a business can explore the relationship between items, allowing it to sell correlated items together to increase sales.

As an alternative to identifying correlated items with association mining, another popular application of data mining is to discover frequent sequential patterns from transaction datasets with temporal information. This can be used in a number of applications, including predicting customer shopping sequence order, web click streams and biological sequences.

The recipes in this chapter cover creating and inspecting transaction datasets, performing association analysis with the Apriori algorithm, visualizing associations in various graph formats, and finding...

Transforming data into transactions


Before using any rule mining algorithm, we need to transform data from the data frame format into transactions. In this example, we demonstrate how to transform a purchase order dataset into transactions with the arules package.

Getting ready

Download the purchase_order.RData dataset from the https://github.com/ywchiu/rcookbook/raw/master/chapter9/product_by_user.RData GitHub link.

How to do it…

Perform the following steps to create transactions:

  1. First, install and load the arules package:

    > install.packages("arules")
    > library(arules)
    
  2. Use the load function to load purchase orders by user into an R session:

    > load("product_by_user.RData")
    
  3. Last, convert the data.table (or data.frame) into transactions with the as function:

    > trans = as(product_by_user $Product, "transactions")
    > trans
    transactions in sparse format with
     32539 transactions (rows) and
     20054 items (columns)
    

How it works…

Before mining a frequent item set or association rule, it is...

Displaying transactions and associations


The arules package uses its transactions class to store transaction data. As such, we must use the generic function provided by arules to display transactions and association rules. In this recipe, we illustrate how to plot transactions and association rules with various functions in the arules package.

Getting ready

Ensure you have completed the previous recipe by generating transactions and storing these in a variable named trans.

How to do it…

Perform the following steps to display transactions and associations:

  1. First, obtain a LIST representation of the transaction data:

    > head(LIST(trans),3)
    $'00001'
    [1] "P0014520085"
    
    $'00002'
    [1] "P0018800250"
    
    $'00003'
    [1] "P0003926850034" "P0013344760004" "P0013834251"    "P0014251480003"
    
  2. Next, use the summary function to show a summary of the statistics and details of the transactions:

    > summary(trans)
    transactions as itemMatrix in sparse format with
     32539 rows (elements/itemsets/transactions) and
     20054...

Mining associations with the Apriori rule


Association mining is a technique that can discover interesting relationships hidden in a transaction dataset. This approach first finds all frequent itemsets and generates strong association rules from frequent itemsets. In this recipe, we will introduce how to perform association analysis using the Apriori rule.

Getting ready

Ensure you have completed the previous recipe by generating transactions and storing these in a variable, trans.

How to do it…

Please perform the following steps to analyze association rules:

  1. Use apriori to discover rules with support over 0.001 and confidence over 0.1:

    > rules <- apriori(trans, parameter = list(supp = 0.001, conf = 0.1, target= "rules"))
    > summary(rules)
    set of 6 rules
     
     rule length distribution (lhs + rhs):sizes
     2 
     6 
     
        Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
           2       2       2       2       2       2 
     
     summary of quality measures:
         support           confidence          lift...

Pruning redundant rules


Among generated rules, we sometimes find repeated or redundant rules (for instance, one rule is the super rule of another rule). In this recipe, we will show how to prune (or remove) repeated or redundant rules.

Getting ready

In this recipe, one has to have completed the previous recipe by generating rules and having these stored in a variable named rules.

How to do it…

Perform the following steps to prune redundant rules:

  1. First, you need to identify the redundant rules:

    > rules.sorted = sort(rules, by="lift")
    > subset.matrix = is.subset(rules.sorted, rules.sorted)
    > subset.matrix[lower.tri(subset.matrix, diag=T)] = NA
    > redundant = colSums(subset.matrix, na.rm=T) >= 1
    
  2. You can then remove the redundant rules:

    > rules.pruned = rules.sorted[!redundant]
    > inspect(rules.pruned)
      lhs                 rhs              support     confidence lift    
    1 {P0014252070}    => {P0014252066}    0.001321491 0.2704403  27.32874
    5 {P0014252055}    => {P0014252066...

Visualizing association rules


To explore the relationship between items, one can visualize the association rules. In the following recipe, we introduce how to use the arulesViz package to visualize association rules.

Getting ready

In this recipe, one has to have completed the previous recipe by generating rules and have these stored in a variable named rules.pruned.

How to do it…

Please perform the following steps to visualize association rules:

  1. First, install and load the arulesViz package:

    > install.packages("arulesViz")
    > library(arulesViz)
    
  2. You can then make a scatterplot from the pruned rules:

    > plot(rules.pruned)
    

    Figure 3: The scatterplot of pruned rules

  3. We can also present the rules in a grouped matrix:

    > plot(rules.pruned,method="grouped")
    

    Figure 4: The grouped matrix for three rules

  4. Alternatively, we can use a graph to present the rules:

    > plot(rules.pruned,method="graph")
    

    Figure 5: The graph for three rules

How it works…

As an alternative to presenting association rules as...

Mining frequent itemsets with Eclat


As the Apriori algorithm performs a breadth-first search to scan the complete database, support counting is rather time-consuming. Alternatively, if the database fits into memory, one can use the Eclat algorithm, which performs a depth-first search to count supports. The Eclat algorithm, therefore, runs much more quickly than the Apriori algorithm. In this recipe, we introduce how to use the Eclat algorithm to generate a frequent itemset.

Getting ready

In this recipe, one has to have completed the previous recipe by generating rules and have these stored in a variable named rules.

How to do it…

Please perform the following steps to generate a frequent itemset using the Eclat algorithm:

  1. Similar to the Apriori method, we can use the eclat function to generate a frequent itemset:

    > frequentsets=eclat(trans,parameter=list(support=0.01,maxlen=10))
    
  2. We can then obtain the summary information from the generated frequent itemset:

    > summary(frequentsets)
    set of...

Creating transactions with temporal information


In addition to mining interesting associations within the transaction database, we can mine interesting sequential patterns using transactions with temporal information. In the following recipe, we demonstrate how to create transactions with temporal information from a web traffic dataset.

Getting ready

Download the web_traffic.csv dataset from the https://github.com/ywchiu/rcookbook/raw/master/chapter9/traffic.RData GitHub link.

We can then generate transactions from the loaded dataset for frequent sequential pattern mining.

How to do it…

Perform the following steps to create transactions with temporal information:

  1. First, install and load the arulesSequences package:

    > install.packages("arulesSequences")
    > library(arulesSequences)
    
  2. Load web traffic data into an R session:

    > load('traffic.RData')
    
  3. Create the transaction data with temporal information:

    > traffic_data<-data.frame(item=traffic$Page)
    > traffic.tran<-as(traffic_data...

Mining frequent sequential patterns with cSPADE


One of the most famous frequent sequential pattern mining algorithms is the SPADE (Sequential PAttern Discovery using Equivalence classes) algorithm, which employs characteristics of the vertical database to perform intersection on ID-list with efficient lattice search and allows us to place constraints on mined sequences. In this recipe, we will demonstrate how to use cSPADE to mine frequent sequential patterns.

Getting ready

In this recipe, one has to have completed the previous recipe by generating transactions with temporal information and have it stored in a variable named traffic.tran.

How to do it…

Please perform the following steps to mine frequent sequential patterns:

  1. First, use the cspade function to generate frequent sequential patterns:

    > frequent_pattern <-cspade(traffic.tran,parameter = list(support = 0.50))
    > inspect(frequent_pattern)
        items                           support 
      1 <{item=/}>                  1.00...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
R for Data Science Cookbook (n)
Published in: Jul 2016Publisher: ISBN-13: 9781784390815
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Yu-Wei, Chiu (David Chiu)

Yu-Wei, Chiu (David Chiu) is the founder of LargitData (www.LargitData.com), a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems. In addition to being a start-up entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques for data analysis. Yu-Wei is also a professional lecturer and has delivered lectures on big data and machine learning in R and Python, and given tech talks at a variety of conferences. In 2015, Yu-Wei wrote Machine Learning with R Cookbook, Packt Publishing. In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, Packt Publishing. For more information, please visit his personal website at www.ywchiu.com. **********************************Acknowledgement************************************** I have immense gratitude for my family and friends for supporting and encouraging me to complete this book. I would like to sincerely thank my mother, Ming-Yang Huang (Miranda Huang); my mentor, Man-Kwan Shan; the proofreader of this book, Brendan Fisher; Members of LargitData; Data Science Program (DSP); and other friends who have offered their support.
Read more about Yu-Wei, Chiu (David Chiu)