我要评分
获取效率
正确性
完整性
易理解

Clustering

K-means

This part describes the impact of K-means algorithm parameters on the model performance.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

k

Number of cluster centers.

Try more values to obtain a better clustering effect.

maxIter

Maximum number of iterations.

Use the default value for a convex dataset. For a non-convex dataset, the algorithm is difficult to converge. In this case, specify the maximum number of iterations so that the algorithm can exit the loop in a timely manner.

initSteps

Number of times the algorithm is run with different initialized centroids.

Run the algorithm for multiple times to find a value with a better clustering effect. The default value is 10. Generally, you do not need to change the value. If the value of the k parameter is large, increase this value.

optMethod

Whether to trigger sampling. This is a newly added parameter and is set by the spark.boostkit.Kmeans.optMethod parameter. The value can be default (trigger sampling) or allData (not trigger sampling).

The default value is default.

sampleRate

Ratio of the data used in each iteration to the full data set. This is a newly added parameter which can affect the calculation efficiency and clustering error. Decreasing the value increases the computing efficiency but may also increase the clustering error. This parameter is set by the spark.boostkit.Kmeans.sampleRate parameter.

The default value is 0.05.

LDA

This part describes the impact of LDA algorithm parameters on the model performance.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

spark.task.cpus

Number of CPU cores allocated to each task.

Keep the same as the value of executor_cores.

spark.driver.cores

Number of CPU cores allocated to the driver process.

Set the parameter based on the actual number of CPU cores.

DBSCAN

This part describes the impact of DBSCAN algorithm parameters on the model performance.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions.

It is recommended that the value of numPartitions be the same as the number of executors. (You can decrease the number of executors and increase the resource configuration of a single executor to improve the performance.)

epsilon

Maximum distance two neighbors can be from one another while still belonging to the same cluster.

The value is greater than 0.0.

minPoints

Minimum number of neighbors of a given point.

Positive Int.

sampleRate

sampleRate indicates the sampling rate of the input data. It is used to divide the space of the full input data based on the sampling data.

The value range is (0.0, 1.0]. The default value is 1.0, indicating that full input data is used by default.