PrefixSpan
The PrefixSpan algorithm uses MLlib APIs.
Model API Type |
Function API |
|---|---|
MLlib API |
def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item] def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item] |
MLlib API
- Input and output
- Package name: package org.apache.spark.mllib.fpm
- Class name: PrefixSpan
- Method name: run
- Input: full sequence data (JavaRDD[Sequence]/RDD[Array[Array[Item]]])
- Algorithm parameters
Algorithm Parameter
MaxLocalProjDBSize
Maximum number of items allowed in a prefix-projected database before local processing
MaxPatternLength
Maximum length of a frequent sequential pattern
MinSupport
Minimum support level of a frequent sequential pattern
- Added algorithm parameters
Parameter
spark conf Parameter Name
Description
Type
localTimeout
spark.boostkit.ml.ps.localTimeout
Timeout interval for local processing, in seconds.
Integer type. The value must be greater than or equal to 0. The default value is 300.
filterCandidates
spark.boostkit.ml.ps.filterCandidates
Whether to filter the prefix candidate set
Boolean type. The default value is false.
projDBStep
spark.boostkit.ml.ps.projDBStep
Adjustment steps of the projection data volume. Advanced parameter. Retain the default value.
Double. The default value is 10.
An example is provided as follows:
1 2 3 4 5
val prefixSpan = new PrefixSpan() .setMinSupport(params.minSupport) .setMaxPatternLength(params.maxPatternLength) .setMaxLocalProjDBSize(params.maxLocalProjDBSize) val model = prefixSpan.run(sequences)
- Output: frequent sequence model (PrefixSpanModel[Item])
- Sample usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
import org.apache.spark.mllib.fpm.PrefixSpan val sequences = sc.parallelize(Seq( Array(Array(1, 2), Array(3)), Array(Array(1), Array(3, 2), Array(1, 2)), Array(Array(1, 2), Array(5)), Array(Array(6)) ), 2).cache() val prefixSpan = new PrefixSpan() .setMinSupport(0.5) .setMaxPatternLength(5) val model = prefixSpan.run(sequences) model.freqSequences.collect().foreach { freqSequence => println( s"${freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]")}," + s" ${freqSequence.freq}") }
- Sample result
1 2 3 4 5
[[2]], 3 [[3]], 2 [[1]], 3 [[2, 1]], 3 [[1], [3]], 2