Clustering
K-means
This part describes the impact of K-means algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
k |
Number of cluster centers. |
Try more values to obtain a better clustering effect. |
maxIter |
Maximum number of iterations. |
Use the default value for a convex dataset. For a non-convex dataset, the algorithm is difficult to converge. In this case, specify the maximum number of iterations so that the algorithm can exit the loop in a timely manner. |
initSteps |
Number of times the algorithm is run with different initialized centroids. |
Run the algorithm for multiple times to find a value with a better clustering effect. The default value is 10. Generally, you do not need to change the value. If the value of the k parameter is large, increase this value. |
optMethod |
Whether to trigger sampling. This is a newly added parameter and is set by the spark.boostkit.Kmeans.optMethod parameter. The value can be default (trigger sampling) or allData (not trigger sampling). |
The default value is default. |
sampleRate |
Ratio of the data used in each iteration to the full data set. This is a newly added parameter which can affect the calculation efficiency and clustering error. Decreasing the value increases the computing efficiency but may also increase the clustering error. This parameter is set by the spark.boostkit.Kmeans.sampleRate parameter. |
The default value is 0.05. |
LDA
This part describes the impact of LDA algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
spark.task.cpus |
Number of CPU cores allocated to each task. |
Keep the same as the value of executor_cores. |
spark.driver.cores |
Number of CPU cores allocated to the driver process. |
Set the parameter based on the actual number of CPU cores. |
DBSCAN
This part describes the impact of DBSCAN algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. |
It is recommended that the value of numPartitions be the same as the number of executors. (You can decrease the number of executors and increase the resource configuration of a single executor to improve the performance.) |
epsilon |
Maximum distance two neighbors can be from one another while still belonging to the same cluster. |
The value is greater than 0.0. |
minPoints |
Minimum number of neighbors of a given point. |
Positive Int. |
sampleRate |
sampleRate indicates the sampling rate of the input data. It is used to divide the space of the full input data based on the sampling data. |
The value range is (0.0, 1.0]. The default value is 1.0, indicating that full input data is used by default. |