PrefixSpan

The PrefixSpan algorithm provides MLlib APIs.

Model API Type	Function API
MLlib API	def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item] def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item]

Model API Type

Function API

MLlib API

def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item]

def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item]

MLlib API

Function description
Output all frequent sequences that meet the conditions after you input sequence data in RDD format, set the minimum support level and maximum length of a frequent sequential pattern, and call the run API.

Input and output

Package name: package org.apache.spark.mllib.fpm
Class name: PrefixSpan
Method name: run
Input: JavaRDD[Sequence] / RDD[Array[Array[Item]]] (full sequence data)

Parameters optimized based on native algorithms

MaxLocalProjDBSize // Maximum number of items allowed in a prefix-projected database before local processing
MaxPatternLength // Maximum length of a frequent sequential pattern
MinSupport // Minimum support of a frequent sequential pattern

Newly added parameters

Parameter	Value Type	Description	spark conf Parameter
localTimeout	Integer type. The value must be greater than or equal to 0. The default value is 300.	Timeout interval for local processing, in seconds	spark.boostkit.ml.ps.localTimeout
filterCandidates	Boolean type. The default value is false.	Whether to filter the prefix candidate set	spark.boostkit.ml.ps.filterCandidates
projDBStep	Double type. The default value is 10.	(Advanced parameter) Adjustment steps of the projection data volume. Retain the default value.	spark.boostkit.ml.ps.projDBStep

An example is provided as follows:

val prefixSpan = new PrefixSpan()
        .setMinSupport(params.minSupport)
        .setMaxPatternLength(params.maxPatternLength)
        .setMaxLocalProjDBSize(params.maxLocalProjDBSize)
val model = prefixSpan.run(sequences)

Output: frequent sequence model (PrefixSpanModel[Item])

Example

import org.apache.spark.mllib.fpm.PrefixSpan
 
val sequences = sc.parallelize(Seq(
  Array(Array(1, 2), Array(3)),
  Array(Array(1), Array(3, 2), Array(1, 2)),
  Array(Array(1, 2), Array(5)),
  Array(Array(6))
), 2).cache()
val prefixSpan = new PrefixSpan()
  .setMinSupport(0.5)
  .setMaxPatternLength(5)
val model = prefixSpan.run(sequences)
model.freqSequences.collect().foreach { freqSequence =>
  println(
    s"${freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]")}," +
      s" ${freqSequence.freq}")
}

Result

[[2]], 3
[[3]], 2
[[1]], 3
[[2, 1]], 3
[[1], [3]], 2

Parent topic: Recommendation and Pattern Mining