Rate This Document
Findability
Accuracy
Completeness
Readability

PrefixSpan

The PrefixSpan algorithm uses MLlib APIs.

Model API Type

Function API

MLlib API

def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item]

def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item]

MLlib API

  • Function

    Import sequence data in RDD format, set the minimum support level and maximum length of a frequent sequential pattern, and call the run API to output all frequent sequences that meet the conditions.

  • Input and output
    1. Package name: package org.apache.spark.mllib.fpm
    2. Class name: PrefixSpan
    3. Method name: run
    4. Input: full sequence data (JavaRDD[Sequence]/RDD[Array[Array[Item]]])
    5. Algorithm parameters

      Algorithm Parameter

      MaxLocalProjDBSize

      Maximum number of items allowed in a prefix-projected database before local processing

      MaxPatternLength

      Maximum length of a frequent sequential pattern

      MinSupport

      Minimum support level of a frequent sequential pattern

    6. Added algorithm parameters

      Parameter

      spark conf Parameter Name

      Description

      Type

      localTimeout

      spark.boostkit.ml.ps.localTimeout

      Timeout interval for local processing, in seconds.

      Integer type. The value must be greater than or equal to 0. The default value is 300.

      filterCandidates

      spark.boostkit.ml.ps.filterCandidates

      Whether to filter the prefix candidate set

      Boolean type. The default value is false.

      projDBStep

      spark.boostkit.ml.ps.projDBStep

      Adjustment steps of the projection data volume. Advanced parameter. Retain the default value.

      Double. The default value is 10.

      An example is provided as follows:

      1
      2
      3
      4
      5
      val prefixSpan = new PrefixSpan()
              .setMinSupport(params.minSupport)
              .setMaxPatternLength(params.maxPatternLength)
              .setMaxLocalProjDBSize(params.maxLocalProjDBSize)
      val model = prefixSpan.run(sequences)
      
    7. Output: frequent sequence model (PrefixSpanModel[Item])
  • Sample usage
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    import org.apache.spark.mllib.fpm.PrefixSpan
     
    val sequences = sc.parallelize(Seq(
      Array(Array(1, 2), Array(3)),
      Array(Array(1), Array(3, 2), Array(1, 2)),
      Array(Array(1, 2), Array(5)),
      Array(Array(6))
    ), 2).cache()
    val prefixSpan = new PrefixSpan()
      .setMinSupport(0.5)
      .setMaxPatternLength(5)
    val model = prefixSpan.run(sequences)
    model.freqSequences.collect().foreach { freqSequence =>
      println(
        s"${freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]")}," +
          s" ${freqSequence.freq}")
    }
    
  • Sample result
    1
    2
    3
    4
    5
    [[2]], 3
    [[3]], 2
    [[1]], 3
    [[2, 1]], 3
    [[1], [3]], 2