In light of already having spent a lot of time with FP-growth on its own, instead of going into too many implementation details of PFP in Spark, let's instead see how to use the actual algorithm on the toy example that we have used throughout:
import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.rdd.RDD
val transactions: RDD[Array[String]] = sc.parallelize(Array(
Array("a", "c", "d", "f", "g", "i", "m", "p"),
Array("a", "b", "c", "f", "l", "m", "o"),
Array("b", "f", "h", "j", "o"),
Array("b", "c", "k", "s", "p"),
Array("a", "c", "e", "f", "l", "m", "n", "p")
))
val fpGrowth = new FPGrowth()
.setMinSupport(0.6)
.setNumPartitions(5)
val model = fpGrowth.run(transactions)
model.freqItemsets.collect().foreach { itemset =>
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}
The code is straightforward. We load the data into transactions and initialize Spark's FPGrowth implementation with a minimum support value of 0.6 and 5 partitions. This returns a model that we can run on the transactions constructed earlier. Doing so gives us access to the patterns or frequent item sets for the specified minimum support, by calling freqItemsets, which, printed in a formatted way, yields the following output of 18 patterns in total:

[box type="info" align="" class="" width=""]Recall that we have defined transactions as sets, and we often call them item sets. This means that within such an item set, a particular item can only occur once, and FPGrowth depends on this. If we were to replace, for instance, the third transaction in the preceding example by Array("b", "b", "h", "j", "o"), calling run on these transactions would throw an error message. We will see later on how to deal with such situations.[/box]
The above is an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. To learn how to fully implement and deploy pattern mining applications in Spark among other machine learning tasks using Spark, check out the book.
