DTB
The DTB algorithm provides ML APIs.
Model API Type |
Function API |
|---|---|
ML DTB API |
def fit(dataset: Dataset[_]): DecisionTreeBucketModel |
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeBucketModel] |
|
def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeBucketModel |
|
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeBucketModel |
ML DTB API
- Input and output
- Package name: package org.apache.spark.ml.feature
- Class name: DecisionTreeBucketizer
- Method name: fit
- Input: document data (Dataset[_]). The following are mandatory fields.
Parameter
Value Type
Default Value
Description
labelCol
Double
"label"
Column that contains the labels
featuresCol
Vector
"features"
Column that contains the features
- Parameters optimized based on native algorithms
def setBucketedFeaturesCol (value: String): DecisionTreeBucketizer.this.type def setCheckpointInterval(value: Int): DecisionTreeBucketizer.this.type def setFeaturesCol(value: String): DecisionTreeBucketizer.this.type def setImpurity(value: String): DecisionTreeBucketizer.this.type def setLabelCol(value: String): DecisionTreeBucketizer.this.type def setMaxBins(value: Int): DecisionTreeBucketizer.this.type def setMaxDepth(value: Int): DecisionTreeBucketizer.this.type def setMinInfoGain(value: Double): DecisionTreeBucketizer.this.type def setMinInstancesPerNode(value: Int): DecisionTreeBucketizer.this.type def setSeed(value: Long): DecisionTreeBucketizer.this.type
You can change the value range of MaxBins to [0, 65535].
- Newly added parameters
Parameter
Value Type
Description
spark conf Parameter
numTrainingDataCopies
Integer type. The value must be greater than or equal to 1. The default value is 1.
Number of training data copies
spark.boostkit.ml.rf.numTrainingDataCopies
broadcastVariables
Boolean type. The default value is false.
Whether to broadcast variables with large storage space
spark.boostkit.ml.rf.broadcastVariables
numPartsPerTrainingDataCopy
Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.
Number of partitions of a single training data copy
spark.boostkit.ml.rf.numPartsPerTrainingDataCopy
binnedFeaturesDataType
String type. The value can be array (default) or fasthashmap.
Storage format of features in training sample data
spark.boostkit.ml.rf.binnedFeaturesDataType
spark.boostkit.ml.rf.numTrainingDataCopies and cacheNodeIds functions cannot be enabled at the same time. If the value of numTrainingDataCopies is greater than 1 and cacheNodeIds is set to true, the value of cacheNodeIds will be forcibly changed to false and a warning will be generated.
- Output: decision tree binning model (DecisionTreeBucketModel). The following are fields output in model inference:
Parameter
Value Type
Default Value
Description
bucketedFeaturesCol
Vector
"bucketedFeatures"
Column that contains the features after binning
- Example
1 2 3 4 5 6 7 8 9
import org.apache.spark.ml.feature. DecisionTreeBucketizer val dtb = new DecisionTreeBucketizer() .setLabelCol("label") .setFeaturesCol("features") .setMaxBins(maxBins) .setMaxDepth(maxDepth) val pipeline = new Pipeline().setStages(Array(dtb)) val model = pipeline.fit(trainingData) val bucketedData = model.transform(trainingData)
- Result
1 2 3 4 5
+----------+----------+------------+---------------+ |label | features | bucketedFeatures | +----------+----------+------------+---------------+ | 1|(1000,[0.3,0.2,1.3,30.0,12.0,...|(1000,[0,11,7,3,2,...| +----------+----------+------------+---------------+