Rate This Document
Findability
Accuracy
Completeness
Readability

DBSCAN

Model API Type

Function API

ML API

def fitPredict(dataset: Dataset[_]): DataFrame

ML API

  • Function description

    Output the clustering result after you input sample data in the dataset format and call the fitPredict API.

  • Input and output
    1. Package name: org.apache.spark.ml.clustering
    2. Class name: DBSCAN
    3. Method name: fitPredict
    4. Input: training sample data (Dataset[_]). The following are mandatory fields.

      Parameter

      Value Type

      Default Value

      Description

      featuresCol

      Vector

      features

      Feature vector

    5. Parameters optimized based on native algorithms
      1
      2
      3
      def setMinPoints(value: Int): DBSCAN.this.type
      def setEpsilon(value: Double): DBSCAN.this.type
      def setSampleRate(value: Double): DBSCAN.this.type
      
      1. epsilon indicates the maximum distance two neighbors can be from one another while still belonging to the same cluster. Its value must be greater than 0.0.
      2. minPoints indicates the minimum number of neighbors of a given point. Its value must be greater than 1.
      3. sampleRate indicates the sampling rate of the input data. It is used to divide the space of the full input data based on the sampled data. The value range is (0.0, 1.0]. The default value is 1.0, indicating that the full input data is used by default.

        Code API example:

        1
        2
        3
        4
         val model = new DBSCAN()
              .setEpsilon(params.epsilon)
              .setMinPoints(params.minPoints)
              .setSampleRate(params.sampleRate)
        
    6. Output: clustering result. The following table lists the output fields.

      Parameter

      Value Type

      Default Value

      Description

      predictionCol

      Int

      prediction

      Category of the sample.

      • -1: noise sample
      • 0: core sample
      • 1: border sample

      labelCol

      Int

      label

      Cluster ID of the sample.

      • For noise samples, the cluster ID is -1 by default.
      • For core/border samples, the cluster ID is greater than or equal to 0.
  • Example
    1
    2
    3
    4
    5
    val dbscan = new DBSCAN()
          .setEpsilon(0.2)
          .setMinPoints(3)
          .setSampleRate(1.0)
    val result = dbscan.fitPredict(trainData)