Clustering
DBSCAN
This part describes the impact of DBSCAN algorithm parameters on the model performance. The default configuration file directory is $KAL_TEST/conf/ml/dbscan, in which $KAL_TEST/conf/ is the kal-test tool deployment directory.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. |
It is recommended that the value of numPartitions be the same as the number of executors. (You can decrease the number of executors and increase the resource configuration of a single executor to improve the performance.) |
epsilon |
Maximum distance two neighbors can be from one another while still belonging to the same cluster. |
The value is greater than 0.0. |
minPoints |
Minimum number of neighbors of a given point. |
Positive integer. |
sampleRate |
sampleRate indicates the sampling rate of the input data. It is used to divide the space of the full input data based on the sampling data. |
The value range is (0.0, 1.0]. The default value is 1.0, indicating that full input data is used by default. |