我要评分
获取效率
正确性
完整性
易理解

DTB

The DTB algorithm provides ML APIs.

Model API Type

Function API

ML DTB API

def fit(dataset: Dataset[_]): DecisionTreeBucketModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeBucketModel]

def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeBucketModel

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeBucketModel

ML DTB API

  • Function description

    Output the DTB model after you input labeled sample data in dataset format and call the training API.

  • Input and output
    1. Package name: package org.apache.spark.ml.feature
    2. Class name: DecisionTreeBucketizer
    3. Method name: fit
    4. Input: document data (Dataset[_]). The following are mandatory fields.

      Parameter

      Value Type

      Default Value

      Description

      labelCol

      Double

      "label"

      Column that contains the labels

      featuresCol

      Vector

      "features"

      Column that contains the features

    5. Parameters optimized based on native algorithms
      def setBucketedFeaturesCol (value: String): DecisionTreeBucketizer.this.type
      def setCheckpointInterval(value: Int): DecisionTreeBucketizer.this.type
      def setFeaturesCol(value: String): DecisionTreeBucketizer.this.type
      def setImpurity(value: String): DecisionTreeBucketizer.this.type
      def setLabelCol(value: String): DecisionTreeBucketizer.this.type
      def setMaxBins(value: Int): DecisionTreeBucketizer.this.type
      def setMaxDepth(value: Int): DecisionTreeBucketizer.this.type
      def setMinInfoGain(value: Double): DecisionTreeBucketizer.this.type
      def setMinInstancesPerNode(value: Int): DecisionTreeBucketizer.this.type
      def setSeed(value: Long): DecisionTreeBucketizer.this.type

      You can change the value range of MaxBins to [0, 65535].

    6. Newly added parameters

      Parameter

      Value Type

      Description

      spark conf Parameter

      numTrainingDataCopies

      Integer type. The value must be greater than or equal to 1. The default value is 1.

      Number of training data copies

      spark.boostkit.ml.rf.numTrainingDataCopies

      broadcastVariables

      Boolean type. The default value is false.

      Whether to broadcast variables with large storage space

      spark.boostkit.ml.rf.broadcastVariables

      numPartsPerTrainingDataCopy

      Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.

      Number of partitions of a single training data copy

      spark.boostkit.ml.rf.numPartsPerTrainingDataCopy

      binnedFeaturesDataType

      String type. The value can be array (default) or fasthashmap.

      Storage format of features in training sample data

      spark.boostkit.ml.rf.binnedFeaturesDataType

      spark.boostkit.ml.rf.numTrainingDataCopies and cacheNodeIds functions cannot be enabled at the same time. If the value of numTrainingDataCopies is greater than 1 and cacheNodeIds is set to true, the value of cacheNodeIds will be forcibly changed to false and a warning will be generated.

    7. Output: decision tree binning model (DecisionTreeBucketModel). The following are fields output in model inference:

      Parameter

      Value Type

      Default Value

      Description

      bucketedFeaturesCol

      Vector

      "bucketedFeatures"

      Column that contains the features after binning

  • Example
    1
    2
    3
    4
    5
    6
    7
    8
    9
    import org.apache.spark.ml.feature. DecisionTreeBucketizer
    val dtb = new DecisionTreeBucketizer()
       .setLabelCol("label")
       .setFeaturesCol("features")
       .setMaxBins(maxBins)
       .setMaxDepth(maxDepth)
    val pipeline = new Pipeline().setStages(Array(dtb))
    val model = pipeline.fit(trainingData)
    val bucketedData = model.transform(trainingData)
    
  • Result
    1
    2
    3
    4
    5
    +----------+----------+------------+---------------+
    |label | features             |    bucketedFeatures   |
    +----------+----------+------------+---------------+
    |  1|(1000,[0.3,0.2,1.3,30.0,12.0,...|(1000,[0,11,7,3,2,...|
    +----------+----------+------------+---------------+