我要评分
获取效率
正确性
完整性
易理解

PrefixSpan

The PrefixSpan algorithm provides MLlib APIs.

Model API Type

Function API

MLlib API

def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item]

def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item]

MLlib API

  • Function description

    Output all frequent sequences that meet the conditions after you input sequence data in RDD format, set the minimum support level and maximum length of a frequent sequential pattern, and call the run API.

  • Input and output
    1. Package name: package org.apache.spark.mllib.fpm
    2. Class name: PrefixSpan
    3. Method name: run
    4. Input: JavaRDD[Sequence] / RDD[Array[Array[Item]]] (full sequence data)
    5. Parameters optimized based on native algorithms
      MaxLocalProjDBSize // Maximum number of items allowed in a prefix-projected database before local processing
      MaxPatternLength // Maximum length of a frequent sequential pattern
      MinSupport // Minimum support of a frequent sequential pattern
    6. Newly added parameters

      Parameter

      Value Type

      Description

      spark conf Parameter

      localTimeout

      Integer type. The value must be greater than or equal to 0. The default value is 300.

      Timeout interval for local processing, in seconds

      spark.boostkit.ml.ps.localTimeout

      filterCandidates

      Boolean type. The default value is false.

      Whether to filter the prefix candidate set

      spark.boostkit.ml.ps.filterCandidates

      projDBStep

      Double type. The default value is 10.

      (Advanced parameter) Adjustment steps of the projection data volume. Retain the default value.

      spark.boostkit.ml.ps.projDBStep

      An example is provided as follows:

      1
      2
      3
      4
      5
      val prefixSpan = new PrefixSpan()
              .setMinSupport(params.minSupport)
              .setMaxPatternLength(params.maxPatternLength)
              .setMaxLocalProjDBSize(params.maxLocalProjDBSize)
      val model = prefixSpan.run(sequences)
      
    7. Output: frequent sequence model (PrefixSpanModel[Item])
  • Example
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    import org.apache.spark.mllib.fpm.PrefixSpan
     
    val sequences = sc.parallelize(Seq(
      Array(Array(1, 2), Array(3)),
      Array(Array(1), Array(3, 2), Array(1, 2)),
      Array(Array(1, 2), Array(5)),
      Array(Array(6))
    ), 2).cache()
    val prefixSpan = new PrefixSpan()
      .setMinSupport(0.5)
      .setMaxPatternLength(5)
    val model = prefixSpan.run(sequences)
    model.freqSequences.collect().foreach { freqSequence =>
      println(
        s"${freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]")}," +
          s" ${freqSequence.freq}")
    }
    
  • Result
    1
    2
    3
    4
    5
    [[2]], 3
    [[3]], 2
    [[1]], 3
    [[2, 1]], 3
    [[1], [3]], 2