DTB

The DTB algorithm provides ML APIs.

Model API Type	Function API
ML DTB API	def fit(dataset: Dataset[_]): DecisionTreeBucketModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeBucketModel]
	def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeBucketModel
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeBucketModel

ML DTB API

Function description
Output the DTB model after you input labeled sample data in dataset format and call the training API.

Input and output

Package name: package org.apache.spark.ml.feature
Class name: DecisionTreeBucketizer
Method name: fit
Input: document data (Dataset[_]). The following are mandatory fields.
Parameter

Value Type

Default Value

Description

labelCol

Double

"label"

Column that contains the labels

featuresCol

Vector

"features"

Column that contains the features

Parameter	Value Type	Default Value	Description
labelCol	Double	"label"	Column that contains the labels
featuresCol	Vector	"features"	Column that contains the features

Parameters optimized based on native algorithms

def setBucketedFeaturesCol (value: String): DecisionTreeBucketizer.this.type
def setCheckpointInterval(value: Int): DecisionTreeBucketizer.this.type
def setFeaturesCol(value: String): DecisionTreeBucketizer.this.type
def setImpurity(value: String): DecisionTreeBucketizer.this.type
def setLabelCol(value: String): DecisionTreeBucketizer.this.type
def setMaxBins(value: Int): DecisionTreeBucketizer.this.type
def setMaxDepth(value: Int): DecisionTreeBucketizer.this.type
def setMinInfoGain(value: Double): DecisionTreeBucketizer.this.type
def setMinInstancesPerNode(value: Int): DecisionTreeBucketizer.this.type
def setSeed(value: Long): DecisionTreeBucketizer.this.type

You can change the value range of MaxBins to [0, 65535].

Newly added parameters

Parameter	Value Type	Description	spark conf Parameter
numTrainingDataCopies	Integer type. The value must be greater than or equal to 1. The default value is 1.	Number of training data copies	spark.boostkit.ml.rf.numTrainingDataCopies
broadcastVariables	Boolean type. The default value is false.	Whether to broadcast variables with large storage space	spark.boostkit.ml.rf.broadcastVariables
numPartsPerTrainingDataCopy	Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.	Number of partitions of a single training data copy	spark.boostkit.ml.rf.numPartsPerTrainingDataCopy
binnedFeaturesDataType	String type. The value can be array (default) or fasthashmap.	Storage format of features in training sample data	spark.boostkit.ml.rf.binnedFeaturesDataType

spark.boostkit.ml.rf.numTrainingDataCopies and cacheNodeIds functions cannot be enabled at the same time. If the value of numTrainingDataCopies is greater than 1 and cacheNodeIds is set to true, the value of cacheNodeIds will be forcibly changed to false and a warning will be generated.

Output: decision tree binning model (DecisionTreeBucketModel). The following are fields output in model inference:
Parameter

Value Type

Default Value

Description

bucketedFeaturesCol

Vector

"bucketedFeatures"

Column that contains the features after binning

Parameter	Value Type	Default Value	Description
bucketedFeaturesCol	Vector	"bucketedFeatures"	Column that contains the features after binning

Example

import org.apache.spark.ml.feature. DecisionTreeBucketizer
val dtb = new DecisionTreeBucketizer()
   .setLabelCol("label")
   .setFeaturesCol("features")
   .setMaxBins(maxBins)
   .setMaxDepth(maxDepth)
val pipeline = new Pipeline().setStages(Array(dtb))
val model = pipeline.fit(trainingData)
val bucketedData = model.transform(trainingData)

Result

+----------+----------+------------+---------------+
|label | features             |    bucketedFeatures   |
+----------+----------+------------+---------------+
|  1|(1000,[0.3,0.2,1.3,30.0,12.0,...|(1000,[0,11,7,3,2,...|
+----------+----------+------------+---------------+

Parent topic: Feature Engineering