Note that, structurally, this is the same definition as for plain unordered patterns, but the < symbol means something else, that is, a subsequence. As before, we drop the database subscript in the notation of
support if the information is clear from the context. Equipped with a notion of
support, the definition of sequential patterns follows the previous definition completely analogously. Given a minimum support threshold
t, a sequence
s in S is said to be a
sequential pattern if supp
(s) is greater than or equal to
t. The formalization of the third question is called the
sequential pattern mining problem, that is, find the full set of sequences that are sequential patterns in S for a given threshold t.
Even in our little example with just four sequences, it can already be challenging to manually inspect all the sequential patterns. To give just one example of a sequential pattern of
support 1.0, a subsequence of length 2 of all the four sequences is
<ac>. Finding all the sequential patterns is an interesting problem, and we will learn about the so-called
prefix span algorithm that Spark employs to address the problem in the following section.
Next time, in part 2 of the tutorial, we will see how to use Spark to solve the above three pattern mining problems using the algorithms introduced.
If you enjoyed this tutorial, an excerpt from the book Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava, check out the book for more.