Recommendation and Pattern Mining

Scenarios

As the Internet industry develops and information resources grow at scale, it becomes increasingly difficult to obtain desired information from a large amount of complex data, which is why extraction of valuable information is urgently needed. The alternating least squares (ALS) method can compare a user's historical purchase records with a product through collaborative filtering to better profile the customer's preferences.

In the recommendation process, the user name, product, and preference degree are used for measurement, and the product is directly rated based on explicit feedback. Implicit feedback refers to data collected from user behavior, for example, the video watching duration. The optimized matrix partitioning technology ensures the continuity of memory access and computing, effectively improves the cache hit ratio, and reduces the latency. By virtue of this, according to actual tests, algorithms including ALS and SimRank deliver 50% better performance without compromising their precision.

Principles

ALS
ALS is a collaborative recommendation algorithm.

SimRank
SimRank is a similarity measure that applies to any domain that has object-to-object relationships. It measures the similarity between objects based on their relationships with other objects. The similarity is obtained by iteratively solving SimRank equations.
PrefixSpan
PrefixSpan is a typical algorithm for frequent pattern mining. It is used to mine frequent sequences that meet the minimum support level. PrefixSpan is efficient because no candidate sequences need to be generated, projected databases keep shrinking quickly, and the memory usage is stable during frequent sequential pattern mining.

Programming Example

This example describes the programming with the PrefixSpan algorithm.

PrefixSpan provides MLlib APIs.

Model API Type	Function API
MLlib API	def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item] def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item]

Model API Type

Function API

MLlib API

def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item]

def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item]

Figure 1 shows the time sequence of the PrefixSpan algorithm.

Figure 1 Time sequence of the PrefixSpan algorithm

Function description
Import sequence data in RDD format, set the minimum support level and maximum length of a frequent sequential pattern, and call the run API to output all frequent sequences that meet the conditions.

Input and output

Package name: package org.apache.spark.mllib.fpm
Class name: PrefixSpan
Method name: run
Input: JavaRDD[Sequence] / RDD[Array[Array[Item]]] (full sequence data)

Parameters optimized based on native algorithms

MaxLocalProjDBSize // Maximum number of items allowed in a prefix-projected database before local processing
MaxPatternLength // Maximum length of a frequent sequential pattern
MinSupport // Minimum support of a frequent sequential pattern

Newly added parameters

Parameter	spark conf Parameter	Description	Value Type
localTimeout	spark.boostkit.ml.ps.localTimeout	Timeout interval for local processing, in seconds	Integer type. The value must be greater than or equal to 0. The default value is 300.
filterCandidates	spark.boostkit.ml.ps.filterCandidates	Whether to filter the prefix candidate set	Boolean type. The default value is false.
projDBStep	spark.boostkit.ml.ps.projDBStep	(Advanced parameter) Adjustment steps of the projection data volume. Retain the default value.	Double type. The default value is 10.

An example is provided as follows:

            
                 val prefixSpan = new PrefixSpan()
        .setMinSupport(params.minSupport)
        .setMaxPatternLength(params.maxPatternLength)
        .setMaxLocalProjDBSize(params.maxLocalProjDBSize)
val model = prefixSpan.run(sequences)

Output: frequent sequence model (PrefixSpanModel[Item])

Example

        
             import org.apache.spark.mllib.fpm.PrefixSpan
 
val sequences = sc.parallelize(Seq(
  Array(Array(1, 2), Array(3)),
  Array(Array(1), Array(3, 2), Array(1, 2)),
  Array(Array(1, 2), Array(5)),
  Array(Array(6))
), 2).cache()
val prefixSpan = new PrefixSpan()
  .setMinSupport(0.5)
  .setMaxPatternLength(5)
val model = prefixSpan.run(sequences)
model.freqSequences.collect().foreach { freqSequence =>
  println(
    s"${freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]")}," +
      s" ${freqSequence.freq}")
}

Result

        
             [[2]], 3
[[3]], 2
[[1]], 3
[[2, 1]], 3
[[1], [3]], 2

Parent topic: Machine Learning Algorithms