Pattern mining using Spark MLlib

[box type="note" align="" class="" width=""]The following is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. [/box]

In part 1 of the tutorial, we motivated and introduced three pattern mining problems along with the necessary notation to properly talk about them. In part 2, we will now discuss how each of these problems can be solved with an algorithm available in Spark MLlib. As is often the case, actually applying the algorithms themselves is fairly simple due to Spark MLlib's convenient run method available for most algorithms. What is more challenging is to understand the algorithms and the intricacies that come with them. To this end, we will explain the three pattern mining algorithms one by one, and study how they are implemented and how to use them on toy examples. Only after having done all this will we apply these algorithms to a real-life data set of click events retrieved from http:/ / MSNBC. com.

The documentation for the pattern mining algorithms in Spark can be found at https:/ / spark. apache. org/ docs/ 2. 1. 0/ mllib- frequent- pattern- mining. html. It provides a good entry point with examples for users who want to dive right in.

Frequent pattern mining with FP-growth

When we introduced the frequent pattern mining problem, we also quickly discussed a strategy to address it based on the apriori principle. The approach was based on scanning the whole transaction database again and again to expensively generate pattern candidates of growing length and checking their support. We indicated that this strategy may not be feasible for very large data.

The so-called FP-growth algorithm, where FP stands for frequent pattern, provides an interesting solution to this data mining problem. The algorithm was originally described in Mining Frequent Patterns without Candidate Generation, available at https:/ /www. cs. sfu. ca/~jpei/ publications/sigmod00. pdf. We will start by explaining the basics of this algorithm and then move on to discussing its distributed version, parallel FP-growth, which has been introduced in PFP: Parallel FP-Growth for Query Recommendation, found at https:/ /static.googleusercontent.com/ media/research. google. com/en/ /pubs/ archive/ 34668. pdf. While Spark's implementation is based on the latter paper, it is best to first understand the baseline algorithm and extend from there.

The core idea of FP-growth is to scan the transaction database D of interest precisely once in the beginning, find all the frequent patterns of length 1, and build a special tree structure called FP-tree from these patterns. Once this step is done, instead of working with D, we only do recursive computations on the usually much smaller FP-tree. This step is called the FP-growth step of the algorithm, since it recursively constructs trees from the subtrees of the original tree to identify patterns. We will call this procedure fragment pattern growth, which does not require us to generate candidates but is rather built on a divide-and-conquer strategy that heavily reduces the workload in each recursion step.

To be more precise, let's first define what an FP-tree is and what it looks like in an example. Recall the example database we used in the last section, shown in Table 1. Our item set consisted of the following 15 grocery items, represented by their first letter: b, c, a, e, d, f, p, m, i, l, o, h, j, k, s. We also discussed the frequent items; that is, patterns of length 1, for a minimum support threshold of t = 0.6, were given by {f, c, b, a, m, p}. In FP-growth, we first use the fact that the ordering of items does not matter for the frequent pattern mining problem; that is, we can choose the order in which to present the frequent items. We do so by ordering them by decreasing frequency. To summarize the situation, let's have a look at the following table:

Transaction ID	Transaction	Ordered frequent items
1	a, c, d, f, g, i, m, p	f, c, a, m, p
2	a, b, c, f, l, m, o	f, c, a, b, m
3	b, f, h, j, o	f, b
4	b, c, k, s, p	c, b, p
5	a, c, e, f, l, m, n, p	f, c, a, m, p

Table 3: Continuation of the example started with Table 1, augmenting the table by ordered frequent items.

As we can see, ordering frequent items like this already helps us to identify some structure. For instance, we see that the item set {f, c, a, m, p} occurs twice and is slightly altered once as {f, c, a, b, m}. The key idea of FP-growth is to use this representation to build a tree from the ordered frequent items that reflect the structure and interdependencies of the items in the third column of Table 3. Every FP-tree has a so-called root node that is used as a base for connecting ordered frequent items as constructed. On the right of the following diagram, we see what is meant by this:

pattern-mining-using-spark-mllib-part-2-img-0

Figure 1: FP-tree and header table for our frequent pattern mining running example.

The left-hand side of Figure 1 shows a header table that we will explain and formalize in just a bit, while the right-hand side shows the actual FP-tree. For each of the ordered frequent items in our example, there is a directed path starting from the root, thereby representing it. Each node of the tree keeps track of not only the frequent item itself but also of the number of paths traversed through this node. For instance, four of the five ordered frequent item sets start with the letter f and one with c. Thus, in the FP-tree, we see f: 4 and c: 1 at the top level. Another interpretation of this fact is that f is a prefix for four item sets and c for one. For another example of this sort of reasoning, let's turn our attention to the lower left of the tree, that is, to the leaf node p: 2. Two occurrences of p tells us that precisely two identical paths end here, which we already know: {f, c, a, m, p} is represented twice. This observation is interesting, as it already hints at a technique used in FP-growth--starting at the leaf nodes of the tree, or the suffixes of the item sets, we can trace back each frequent item set, and the union of all these distinct root node paths yields all the paths--an important idea for parallelization.

The header table you see on the left of Figure 1 is a smart way of storing items. Note that by the construction of the tree, a node is not the same as a frequent item but, rather, items can and usually do occur multiple times, namely once for each distinct path they are part of. To keep track of items and how they relate, the header table is essentially a linked list of items, that is, each item occurrence is linked to the next by means of this table. We indicated the links for each frequent item by horizontal dashed lines in Figure 1 for illustration purposes.

With this example in mind, let's now give a formal definition of an FP-tree. An FP-tree T is a tree that consists of a root node together with frequent item prefix subtrees starting at the root and a frequent item header table. Each node of the tree consists of a triple, namely the item name, its occurrence count, and a node link referring to the next node of the same name, or null if there is no such next node.

To quickly recap, to build T, we start by computing the frequent items for the given minimum support threshold t, and then, starting from the root, insert each path represented by the sorted frequent pattern list of a transaction into the tree. Now, what do we gain from this? The most important property to consider is that all the information needed to solve the frequent pattern mining problem is encoded in the FP-tree T because we effectively encode all co-occurrences of frequent items with repetition. Since T can also have at most as many nodes as the occurrences of frequent items, T is usually much smaller than our original database D. This means that we have mapped the mining problem to a problem on a smaller data set, which in itself reduces the computational complexity compared with the naive approach sketched earlier.

Next, we'll discuss how to grow patterns recursively from fragments obtained from the constructed FP tree. To do so, let's make the following observation. For any given frequent item x, we can obtain all the patterns involving x by following the node links for x, starting from the header table entry for x, by analyzing at the respective subtrees. To explain how exactly, we further study our example and, starting at the bottom of the header table, analyze patterns containing p. From our FP-tree T, it is clear that p occurs in two paths: (f:4, c:3, a:3, m:3, p:2) and (c:1, b:1, p:1), following the node links for p. Now, in the first path, p occurs only twice, that is, there can be at most two total occurrences of the pattern {f, c, a, m, p} in the original database D. So, conditional on p being present, the paths involving p actually read as follows: (f:2, c:2, a:2, m:2, p:2) and (c:1, b:1, p:1). In fact, since we know we want to analyze patterns, given p, we can shorten the notation a little and simply write (f:2, c:2, a:2, m:2) and (c:1, b:1). This is what we call the conditional pattern base for p. Going one step further, we can construct a new FP-tree from this conditional database. Conditioning on three occurrences of p, this new tree does only consist of a single node, namely (c:3). This means that we end up with {c, p} as a single pattern involving p, apart from p itself. To have a better means of talking about this situation, we introduce the following notation: the conditional FP-tree for p is denoted by {(c:3)}|p.

To gain more intuition, let's consider one more frequent item and discuss its conditional pattern base. Continuing bottom to top and analyzing m, we again see two paths that are relevant: (f:4, c:3, a:3, m:2) and (f:4, c:3, a:3, b:1, m:1). Note that in the first path, we discard the p:2 at the end, since we have already covered the case of p. Following the same logic of reducing all other counts to the count of the item in question and conditioning on m, we end up with the conditional pattern base {(f:2, c:2, a:2), (f:1, c:1, a:1, b:1)}. The conditional FP- tree in this situation is thus given by {f:3, c:3, a:3}|m. It is now easy to see that actually every possible combination of m with each of f, c, and a forms a frequent pattern. The full set of patterns, given m, is thus {m}, {am}, {cm}, {fm}, {cam], {fam}, {fcm}, and {fcam}. By now, it should become clear as to how to continue, and we will not carry out this exercise in full but rather summarize the outcome of it in the following table:

Frequent pattern	Conditional pattern base	Conditional FP-tree
p	{(f:2, c:2, a:2, m:2), (c:1, b:1)}	{(c:3)}\|p
m	{(f :2, c:2, a:2), (f :1, c:1, a:1, b:1)}	{f:3, c:3, a:3}\|m
b	{(f :1, c:1, a:1), (f :1), (c:1)}	null
a	{(f:3, c:3)}	{(f:3, c:3)}\|a
c	{(f:3)}	{(f:3)}\|c
f	null	null

Table 4: The complete list of conditional FP-trees and conditional pattern bases for our running example.

As this derivation required a lot of attention to detail, let's take a step back and summarize the situation so far:

Starting from the original FP-tree T, we iterated through all the items using node links.
For each item x, we constructed its conditional pattern base and its conditional FP-tree. Doing so, we used the following two properties:We discarded all the items following x in each potential pattern, that is, we only kept the prefix of xWe modified the item counts in the conditional pattern base to match the count of x
Modifying a path using the latter two properties, we called the transformed prefix path of x.

To finally state the FP-growth step of the algorithm, we need two more fundamental observations that we have already implicitly used in the example. Firstly, the support of an item in a conditional pattern base is the same as that of its representation in the original database. Secondly, starting from a frequent pattern x in the original database and an arbitrary set of items y, we know that xy is a frequent pattern if and only if y is. These two facts can easily be derived in general, but should be clearly demonstrated in the preceding example.

What this means is that we can completely focus on finding patterns in conditional pattern bases, as joining them with frequent patterns is again a pattern, and this way, we can find all the patterns. This mechanism of recursively growing patterns by computing conditional pattern bases is therefore called pattern growth, which is why FP-growth bears its name. With all this in mind, we can now summarize the FP-growth procedure in pseudocode, as follows:

def fpGrowth(tree: FPTree, i: Item):
       if (tree consists of a single path P){
           compute transformed prefix path P' of P
           return all combinations p in P' joined with i
       }
       else{
           for each item in tree {
                 newI = i joined with item
                 construct conditional pattern base and conditional FP-tree
newTree
                 call fpGrowth(newTree, newI)
           }
}

With this procedure, we can summarize our description of the complete FP-growth algorithm as follows:

Compute frequent items from D and compute the original FP-tree T from them (FP-tree computation).
Run fpGrowth(T, null) (FP-growth computation).

Having understood the base construction, we can now proceed to discuss a parallel extension of base FP-growth, that is, the basis of Spark's implementation. Parallel FP- growth, or PFP for short, is a natural evolution of FP-growth for parallel computing engines such as Spark. It addresses the following problems with the baseline algorithm:

Distributed storage: For frequent pattern mining, our database D may not fit into memory, which can already render FP-growth in its original form unapplicable. Spark does help in this regard for obvious reasons.
Distributed computing: With distributed storage in place, we will have to take care of parallelizing all the steps of the algorithm suitably as well and PFP does precisely this.
Adequate support values: When dealing with finding frequent patterns, we usually do not want to set the minimum support threshold t too high so as to find interesting patterns in the long tail. However, a small t might prevent the FP-tree from fitting into memory for a sufficiently large D, which would force us to increase t. PFP successfully addresses this problem as well, as we will see.

The basic outline of PFP, with Spark for implementation in mind, is as follows:

Sharding: Instead of storing our database D on a single machine, we distribute it to multiple partitions. Regardless of the particular storage layer, using Spark we can, for instance, create an RDD to load D.
Parallel frequent item count: The first step of computing frequent items of D can be naturally performed as a map-reduce operation on an RDD.
Building groups of frequent items: The set of frequent items is divided into a number of groups, each with a unique group ID.
Parallel FP-growth: The FP-growth step is split into two steps to leverage parallelism:
Map phase: The output of a mapper is a pair comprising the group ID and the corresponding transaction.
Reduce phase: Reducers collect data according to the group ID and carry out FP-growth on these group-dependent transactions.
Aggregation: The final step in the algorithm is the aggregation of results over group IDs.

In light of already having spent a lot of time with FP-growth on its own, instead of going into too many implementation details of PFP in Spark, let's instead see how to use the actual algorithm on the toy example that we have used throughout:

import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.rdd.RDD
  val transactions: RDD[Array[String]] = sc.parallelize(Array(
     Array("a", "c", "d", "f", "g", "i", "m", "p"),
     Array("a", "b", "c", "f", "l", "m", "o"),
     Array("b", "f", "h", "j", "o"),
     Array("b", "c", "k", "s", "p"),
     Array("a", "c", "e", "f", "l", "m", "n", "p")
   ))
   val fpGrowth = new FPGrowth()
     .setMinSupport(0.6)
     .setNumPartitions(5)
   val model = fpGrowth.run(transactions)
   model.freqItemsets.collect().foreach { itemset =>
     println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}

The code is straightforward. We load the data into transactions and initialize Spark's FPGrowth implementation with a minimum support value of 0.6 and 5 partitions. This returns a model that we can run on the transactions constructed earlier. Doing so gives us access to the patterns or frequent item sets for the specified minimum support, by calling freqItemsets, which, printed in a formatted way, yields the following output of 18 patterns in total:
pattern-mining-using-spark-mllib-part-2-img-1

[box type="info" align="" class="" width=""]Recall that we have defined transactions as sets, and we often call them item sets. This means that within such an item set, a particular item can only occur once, and FPGrowth depends on this. If we were to replace, for instance, the third transaction in the preceding example by Array("b", "b", "h", "j", "o"), calling run on these transactions would throw an error message. We will see later on how to deal with such situations.[/box]

The above is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. To learn how to fully implement and deploy pattern mining applications in Spark among other machine learning tasks using Spark, check out the book.