我要评分
获取效率
正确性
完整性
易理解

Recommendation and Pattern Mining

Scenarios

As the Internet industry develops and information resources grow at scale, it becomes increasingly difficult to obtain desired information from a large amount of complex data, which is why extraction of valuable information is urgently needed. The alternating least squares (ALS) method can compare a user's historical purchase records with a product through collaborative filtering to better profile the customer's preferences.

In the recommendation process, the user name, product, and preference degree are used for measurement, and the product is directly rated based on explicit feedback. Implicit feedback refers to data collected from user behavior, for example, the video watching duration. The optimized matrix partitioning technology ensures the continuity of memory access and computing, effectively improves the cache hit ratio, and reduces the latency. By virtue of this, according to actual tests, algorithms including ALS and SimRank deliver 50% better performance without compromising their precision.

Principles

  • ALS

    ALS is a collaborative recommendation algorithm.

  • SimRank

    SimRank is a similarity measure that applies to any domain that has object-to-object relationships. It measures the similarity between objects based on their relationships with other objects. The similarity is obtained by iteratively solving SimRank equations.

  • PrefixSpan

    PrefixSpan is a typical algorithm for frequent pattern mining. It is used to mine frequent sequences that meet the minimum support level. PrefixSpan is efficient because no candidate sequences need to be generated, projected databases keep shrinking quickly, and the memory usage is stable during frequent sequential pattern mining.

Programming Example

This example describes the programming with the PrefixSpan algorithm.

PrefixSpan provides MLlib APIs.

Model API Type

Function API

MLlib API

def run[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: JavaRDD[Sequence]): PrefixSpanModel[Item]

def run[Item](data: RDD[Array[Array[Item]]])(implicit arg0: ClassTag[Item]): PrefixSpanModel[Item]

Figure 1 shows the time sequence of the PrefixSpan algorithm.

Figure 1 Time sequence of the PrefixSpan algorithm
  • Function description

    Import sequence data in RDD format, set the minimum support level and maximum length of a frequent sequential pattern, and call the run API to output all frequent sequences that meet the conditions.

  • Input and output
    1. Package name: package org.apache.spark.mllib.fpm
    2. Class name: PrefixSpan
    3. Method name: run
    4. Input: JavaRDD[Sequence] / RDD[Array[Array[Item]]] (full sequence data)
    5. Parameters optimized based on native algorithms
      MaxLocalProjDBSize // Maximum number of items allowed in a prefix-projected database before local processing
      MaxPatternLength // Maximum length of a frequent sequential pattern
      MinSupport // Minimum support of a frequent sequential pattern
    6. Newly added parameters

      Parameter

      spark conf Parameter

      Description

      Value Type

      localTimeout

      spark.boostkit.ml.ps.localTimeout

      Timeout interval for local processing, in seconds

      Integer type. The value must be greater than or equal to 0. The default value is 300.

      filterCandidates

      spark.boostkit.ml.ps.filterCandidates

      Whether to filter the prefix candidate set

      Boolean type. The default value is false.

      projDBStep

      spark.boostkit.ml.ps.projDBStep

      (Advanced parameter) Adjustment steps of the projection data volume. Retain the default value.

      Double type. The default value is 10.

      An example is provided as follows:

      1
      2
      3
      4
      5
      val prefixSpan = new PrefixSpan()
              .setMinSupport(params.minSupport)
              .setMaxPatternLength(params.maxPatternLength)
              .setMaxLocalProjDBSize(params.maxLocalProjDBSize)
      val model = prefixSpan.run(sequences)
      
    7. Output: frequent sequence model (PrefixSpanModel[Item])
  • Example
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    import org.apache.spark.mllib.fpm.PrefixSpan
     
    val sequences = sc.parallelize(Seq(
      Array(Array(1, 2), Array(3)),
      Array(Array(1), Array(3, 2), Array(1, 2)),
      Array(Array(1, 2), Array(5)),
      Array(Array(6))
    ), 2).cache()
    val prefixSpan = new PrefixSpan()
      .setMinSupport(0.5)
      .setMaxPatternLength(5)
    val model = prefixSpan.run(sequences)
    model.freqSequences.collect().foreach { freqSequence =>
      println(
        s"${freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]")}," +
          s" ${freqSequence.freq}")
    }
    
  • Result
    1
    2
    3
    4
    5
    [[2]], 3
    [[3]], 2
    [[1]], 3
    [[2, 1]], 3
    [[1], [3]], 2