DBSCAN
Model API Type |
Function API |
|---|---|
ML API |
def fitPredict(dataset: Dataset[_]): DataFrame |
ML API
- Input and output
- Package name: org.apache.spark.ml.clustering
- Class name: DBSCAN
- Method name: fitPredict
- Input: training sample data (Dataset[_]). The following are mandatory fields.
Parameter
Value Type
Default Value
Description
featuresCol
Vector
features
Feature vector
- Parameters optimized based on native algorithms
1 2 3
def setMinPoints(value: Int): DBSCAN.this.type def setEpsilon(value: Double): DBSCAN.this.type def setSampleRate(value: Double): DBSCAN.this.type
- epsilon indicates the maximum distance two neighbors can be from one another while still belonging to the same cluster. Its value must be greater than 0.0.
- minPoints indicates the minimum number of neighbors of a given point. Its value must be greater than 1.
- sampleRate indicates the sampling rate of the input data. It is used to divide the space of the full input data based on the sampled data. The value range is (0.0, 1.0]. The default value is 1.0, indicating that the full input data is used by default.
Code API example:
1 2 3 4
val model = new DBSCAN() .setEpsilon(params.epsilon) .setMinPoints(params.minPoints) .setSampleRate(params.sampleRate)
- Output: clustering result. The following table lists the output fields.
Parameter
Value Type
Default Value
Description
predictionCol
Int
prediction
Category of the sample.
- -1: noise sample
- 0: core sample
- 1: border sample
labelCol
Int
label
Cluster ID of the sample.
- For noise samples, the cluster ID is -1 by default.
- For core/border samples, the cluster ID is greater than or equal to 0.
- Example
1 2 3 4 5
val dbscan = new DBSCAN() .setEpsilon(0.2) .setMinPoints(3) .setSampleRate(1.0) val result = dbscan.fitPredict(trainData)
Parent topic: Clustering