Rate This Document

Findability

Accuracy

Completeness

Readability

Recommendation and Pattern Mining

PrefixSpan

This part describes the impact of PrefixSpan algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
localTimeout	Timeout interval for local resolution. This is a newly added parameter and is set by the spark.boostkit.ml.ps.localTimeout parameter. The unit is second.	If local resolution takes longer than other phases, set this parameter to a smaller value. The value 300 is recommended.
filterCandidates	Whether to filter the prefix candidate set. This is a newly added parameter and is set by the spark.boostkit.ml.ps.filterCandidates parameter. If this parameter is enabled, the communication volume decreases and the calculation workload increases.	Boolean. The default value is false.
projDBStep	Attenuation rate of the projection data volume. This is a newly added parameter and is set by the spark.boostkit.ml.ps.projDBStep parameter. Retain the default value. A larger value means less calculation workload of the local processing.	Double. The default value is 10.

ALS

This part describes the impact of ALS algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
blockMaxRow	Row block size of a Grime matrix, which is related to the L1 cache size and affects the calculation performance of the local matrix. This is a newly added parameter and is set by the spark.boostkit.ALS.blockMaxRow parameter.	Positive Int. The default value is 16. You are advised to retain the default value.
unpersistCycle	Unpersist period for srcFactorRDD. When the number of iterations reaches the specified value, the accumulated srcFactorRDD is unpersisted to release the memory. A smaller value indicates that the memory is released more frequently. This is a newly added parameter and is set by the spark.boostkit.ALS.unpersistCycle parameter.	Positive Int. The default value is 300. You are advised to retain the default value.

Parent topic: Algorithm Parameter Tuning