我要评分
获取效率
正确性
完整性
易理解

Recommendation and Pattern Mining

PrefixSpan

This part describes the impact of PrefixSpan algorithm parameters on the model performance.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

localTimeout

Timeout interval for local resolution. This is a newly added parameter and is set by the spark.boostkit.ml.ps.localTimeout parameter. The unit is second.

If local resolution takes longer than other phases, set this parameter to a smaller value. The value 300 is recommended.

filterCandidates

Whether to filter the prefix candidate set. This is a newly added parameter and is set by the spark.boostkit.ml.ps.filterCandidates parameter. If this parameter is enabled, the communication volume decreases and the calculation workload increases.

Boolean. The default value is false.

projDBStep

Attenuation rate of the projection data volume. This is a newly added parameter and is set by the spark.boostkit.ml.ps.projDBStep parameter. Retain the default value. A larger value means less calculation workload of the local processing.

Double. The default value is 10.

ALS

This part describes the impact of ALS algorithm parameters on the model performance.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

blockMaxRow

Row block size of a Grime matrix, which is related to the L1 cache size and affects the calculation performance of the local matrix. This is a newly added parameter and is set by the spark.boostkit.ALS.blockMaxRow parameter.

Positive Int. The default value is 16. You are advised to retain the default value.

unpersistCycle

Unpersist period for srcFactorRDD. When the number of iterations reaches the specified value, the accumulated srcFactorRDD is unpersisted to release the memory. A smaller value indicates that the memory is released more frequently. This is a newly added parameter and is set by the spark.boostkit.ALS.unpersistCycle parameter.

Positive Int. The default value is 300. You are advised to retain the default value.