Recommendation and Pattern Mining
Scenarios
As the Internet industry develops and information resources grow at scale, it becomes increasingly difficult to obtain desired information from a large amount of complex data, which is why extraction of valuable information is urgently needed. The alternating least squares (ALS) method can compare a user's historical purchase records with a product through collaborative filtering to better profile the customer's preferences.
In the recommendation process, the user name, product, and preference degree are used for measurement, and the product is directly rated based on explicit feedback. Implicit feedback refers to data collected from user behavior, for example, the video watching duration. The optimized matrix partitioning technology ensures the continuity of memory access and computing, effectively improves the cache hit ratio, and reduces the latency. By virtue of this, according to actual tests, algorithms including ALS and SimRank deliver 50% better performance without compromising their precision.
Principles
- SimRank
SimRank is a similarity measure that applies to any domain that has object-to-object relationships. It measures the similarity between objects based on their relationships with other objects. The similarity is obtained by iteratively solving SimRank equations.
- PrefixSpan
PrefixSpan is a typical algorithm for frequent pattern mining. It is used to mine frequent sequences that meet the minimum support level. PrefixSpan is efficient because no candidate sequences need to be generated, projected databases keep shrinking quickly, and the memory usage is stable during frequent sequential pattern mining.
Programming Example
This example describes the programming with the PrefixSpan algorithm.
PrefixSpan provides MLlib APIs.
|
Model API Type |
Function API |
|---|---|
|
MLlib API |
def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item] def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item] |
Figure 1 shows the time sequence of the PrefixSpan algorithm.
- Function description
Import sequence data in RDD format, set the minimum support level and maximum length of a frequent sequential pattern, and call the run API to output all frequent sequences that meet the conditions.
- Input and output
- Package name: package org.apache.spark.mllib.fpm
- Class name: PrefixSpan
- Method name: run
- Input: JavaRDD[Sequence] / RDD[Array[Array[Item]]] (full sequence data)
- Parameters optimized based on native algorithms
MaxLocalProjDBSize // Maximum number of items allowed in a prefix-projected database before local processing MaxPatternLength // Maximum length of a frequent sequential pattern MinSupport // Minimum support of a frequent sequential pattern
- Newly added parameters
Parameter
spark conf Parameter
Description
Value Type
localTimeout
spark.boostkit.ml.ps.localTimeout
Timeout interval for local processing, in seconds
Integer type. The value must be greater than or equal to 0. The default value is 300.
filterCandidates
spark.boostkit.ml.ps.filterCandidates
Whether to filter the prefix candidate set
Boolean type. The default value is false.
projDBStep
spark.boostkit.ml.ps.projDBStep
(Advanced parameter) Adjustment steps of the projection data volume. Retain the default value.
Double type. The default value is 10.
An example is provided as follows:
1 2 3 4 5
val prefixSpan = new PrefixSpan() .setMinSupport(params.minSupport) .setMaxPatternLength(params.maxPatternLength) .setMaxLocalProjDBSize(params.maxLocalProjDBSize) val model = prefixSpan.run(sequences)
- Output: frequent sequence model (PrefixSpanModel[Item])
- Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
import org.apache.spark.mllib.fpm.PrefixSpan val sequences = sc.parallelize(Seq( Array(Array(1, 2), Array(3)), Array(Array(1), Array(3, 2), Array(1, 2)), Array(Array(1, 2), Array(5)), Array(Array(6)) ), 2).cache() val prefixSpan = new PrefixSpan() .setMinSupport(0.5) .setMaxPatternLength(5) val model = prefixSpan.run(sequences) model.freqSequences.collect().foreach { freqSequence => println( s"${freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]")}," + s" ${freqSequence.freq}") }
- Result
1 2 3 4 5
[[2]], 3 [[3]], 2 [[1]], 3 [[2, 1]], 3 [[1], [3]], 2
