PrefixSpan
The PrefixSpan algorithm provides MLlib APIs.
Model API Type |
Function API |
|---|---|
MLlib API |
def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item] def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item] |
MLlib API
- Input and output
- Package name: package org.apache.spark.mllib.fpm
- Class name: PrefixSpan
- Method name: run
- Input: JavaRDD[Sequence] / RDD[Array[Array[Item]]] (full sequence data)
- Parameters optimized based on native algorithms
MaxLocalProjDBSize // Maximum number of items allowed in a prefix-projected database before local processing MaxPatternLength // Maximum length of a frequent sequential pattern MinSupport // Minimum support of a frequent sequential pattern
- Newly added parameters
Parameter
Value Type
Description
spark conf Parameter
localTimeout
Integer type. The value must be greater than or equal to 0. The default value is 300.
Timeout interval for local processing, in seconds
spark.boostkit.ml.ps.localTimeout
filterCandidates
Boolean type. The default value is false.
Whether to filter the prefix candidate set
spark.boostkit.ml.ps.filterCandidates
projDBStep
Double type. The default value is 10.
(Advanced parameter) Adjustment steps of the projection data volume. Retain the default value.
spark.boostkit.ml.ps.projDBStep
An example is provided as follows:
1 2 3 4 5
val prefixSpan = new PrefixSpan() .setMinSupport(params.minSupport) .setMaxPatternLength(params.maxPatternLength) .setMaxLocalProjDBSize(params.maxLocalProjDBSize) val model = prefixSpan.run(sequences)
- Output: frequent sequence model (PrefixSpanModel[Item])
- Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
import org.apache.spark.mllib.fpm.PrefixSpan val sequences = sc.parallelize(Seq( Array(Array(1, 2), Array(3)), Array(Array(1), Array(3, 2), Array(1, 2)), Array(Array(1, 2), Array(5)), Array(Array(6)) ), 2).cache() val prefixSpan = new PrefixSpan() .setMinSupport(0.5) .setMaxPatternLength(5) val model = prefixSpan.run(sequences) model.freqSequences.collect().foreach { freqSequence => println( s"${freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]")}," + s" ${freqSequence.freq}") }
- Result
1 2 3 4 5
[[2]], 3 [[3]], 2 [[1]], 3 [[2, 1]], 3 [[1], [3]], 2