PrefixSpan

The PrefixSpan algorithm uses MLlib APIs.

Model API Type	Function API
MLlib API	def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item] def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item]

Model API Type

Function API

MLlib API

def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item]

def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item]

MLlib API

Function
Import sequence data in RDD format, set the minimum support level and maximum length of a frequent sequential pattern, and call the run API to output all frequent sequences that meet the conditions.

Input and output

Package name: package org.apache.spark.mllib.fpm
Class name: PrefixSpan
Method name: run
Input: full sequence data (JavaRDD[Sequence]/RDD[Array[Array[Item]]])

Algorithm parameters

Algorithm Parameter
MaxLocalProjDBSize Maximum number of items allowed in a prefix-projected database before local processing MaxPatternLength Maximum length of a frequent sequential pattern MinSupport Minimum support level of a frequent sequential pattern

Algorithm Parameter

MaxLocalProjDBSize

Maximum number of items allowed in a prefix-projected database before local processing

MaxPatternLength

Maximum length of a frequent sequential pattern

MinSupport

Minimum support level of a frequent sequential pattern

Added algorithm parameters

Parameter	spark conf Parameter Name	Description	Type
localTimeout	spark.boostkit.ml.ps.localTimeout	Timeout interval for local processing, in seconds.	Integer type. The value must be greater than or equal to 0. The default value is 300.
filterCandidates	spark.boostkit.ml.ps.filterCandidates	Whether to filter the prefix candidate set	Boolean type. The default value is false.
projDBStep	spark.boostkit.ml.ps.projDBStep	Adjustment steps of the projection data volume. Advanced parameter. Retain the default value.	Double. The default value is 10.

An example is provided as follows:

val prefixSpan = new PrefixSpan()
        .setMinSupport(params.minSupport)
        .setMaxPatternLength(params.maxPatternLength)
        .setMaxLocalProjDBSize(params.maxLocalProjDBSize)
val model = prefixSpan.run(sequences)

Output: frequent sequence model (PrefixSpanModel[Item])

Sample usage

import org.apache.spark.mllib.fpm.PrefixSpan
 
val sequences = sc.parallelize(Seq(
  Array(Array(1, 2), Array(3)),
  Array(Array(1), Array(3, 2), Array(1, 2)),
  Array(Array(1, 2), Array(5)),
  Array(Array(6))
), 2).cache()
val prefixSpan = new PrefixSpan()
  .setMinSupport(0.5)
  .setMaxPatternLength(5)
val model = prefixSpan.run(sequences)
model.freqSequences.collect().foreach { freqSequence =>
  println(
    s"${freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]")}," +
      s" ${freqSequence.freq}")
}

Sample result

[[2]], 3
[[3]], 2
[[1]], 3
[[2, 1]], 3
[[1], [3]], 2

Parent topic: Developing an Application