Rate This Document
Findability
Accuracy
Completeness
Readability

Clustering

DBSCAN

This part describes the impact of DBSCAN algorithm parameters on the model performance. The default configuration file directory is $KAL_TEST/conf/ml/dbscan, in which $KAL_TEST/conf/ is the kal-test tool deployment directory.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions.

It is recommended that the value of numPartitions be the same as the number of executors. (You can decrease the number of executors and increase the resource configuration of a single executor to improve the performance.)

epsilon

Maximum distance two neighbors can be from one another while still belonging to the same cluster.

The value is greater than 0.0.

minPoints

Minimum number of neighbors of a given point.

Positive integer.

sampleRate

sampleRate indicates the sampling rate of the input data. It is used to divide the space of the full input data based on the sampling data.

The value range is (0.0, 1.0]. The default value is 1.0, indicating that full input data is used by default.