Rate This Document
Findability
Accuracy
Completeness
Readability

Classification and Regression

GBDT

Samples differ by application scenarios. Therefore, an algorithm with the same parameters can have different performance in different scenarios. The performance is mainly reflected in the prediction precision and algorithm convergence speed. Some parameters can be adjusted to accelerate the convergence and thus improve the model precision and the overall algorithm performance. The following uses the GBDT algorithm as an example to describe the impact of each parameter on the model performance.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

maxIter

Maximum number of iterations or subtrees. A small value may cause underfitting; and a large value may cause overfitting and slow down the model convergence.

Search for the parameter within [80, 120]. The default value 100 is recommended.

stepSize

Learning rate, that is, the update range of the subtree weight in each iteration. If the value is too large, convergence flapping occurs and the convergence fails. If the value is too small, the convergence is slow, and locally optimal results might be returned.

Search for the parameter within [0, 1]. The value 0.1 is recommended.

maxDepth

Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence will be slowed down and overfitting may occur.

[3, 100]. The value 5 is recommended.

maxBins

Maximum number of features considered during the division of each subtree, depending on the total number (N) of features. If the value is too large, the subtree generation time is prolonged.

The value of is recommended.

doUseAcc

Indicates whether to enable the feature parallel training mode. The value can be true (feature parallel mode) or false (sample parallel mode).

Boolean. The default value is true.

Random Forest

This part describes the impact of Random Forest algorithm parameters on the model performance.

Parameter

Description

Suggestion

genericPt

Number of re-partitions during data reading. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 1 to 2 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

numCopiesInput

Number of training data copies. This is a newly added parameter and is specified by the spark.boostkit.ml.rf.numTrainingDataCopies parameter. Increasing the value increases the degree of parallelism and memory overhead.

The default value is 1.

pt

Number of partitions used in the training phase of the optimization algorithm. This parameter is specified by the spark.boostkit.ml.rf.numPartsPerTrainingDataCopy parameter.

The recommended value is the round-down value of the total number of cores divided by the number of copies.

featuresType

Storage format of features in the training sample data. This is a newly added parameter and is specified by the spark.boostkit.ml.rf.binnedFeaturesDataType parameter. The value is of the enumerated type and can be array or fasthashmap. The default value is array.

The value fasthashmap is recommended if the number of dimensions is greater than 5000. The value array is recommended if the number of dimensions is less than 5000.

broadcastVariables

Indicates whether to broadcast variables with large storage space. This is a newly added parameter and is set by the spark.boostkit.ml.rf.broadcastVariables parameter. The default value is false.

Set this parameter to true if the number of dimensions is greater than 10000.

maxDepth

Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence will be slowed down and overfitting may occur.

[3, 100]. The recommended value range is 11 to 15.

maxBins

Maximum number of features considered during the division of each subtree. A larger value indicates a more accurate solution. However, if the value is too large, the subtree generation time is prolonged.

The value 128 is recommended.

maxMemoryInMB

Maximum size of the memory for storing statistics. A larger value indicates that less data can be traversed, and thus improving the training speed. However, increasing the value increases the communication overhead of each iteration.

The value 2048 or 4096 is recommended. Increase the value for datasets of high dimensions. For example, set this parameter to 10240 for these datasets.

SVM

This part describes the impact of SVM algorithm parameters on the model performance.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

maxIter

Maximum number of iterations. If the value is too large, the training time is too long and the model may be overfitted, reducing the accuracy. If the value is too small, the model cannot be converged to the optimal value and the accuracy is low.

Search for the parameter within [50, 150]. The default value 100 is recommended. Reduce the number of iterations for a dataset with a small number of features.

inertiaCoefficient

Weight of the historical direction information in momentum calculation. This is a newly added parameter and is set by the spark.boostkit.LinearSVC.inertiaCoefficient parameter. This parameter is a positive real number of the double-precision type and is used to optimize the accuracy.

The default value is 0.5.

Decision Tree

This part describes the impact of Decision Tree algorithm parameters on the model performance.

Parameter

Description

Suggestion

genericPt

Number of re-partitions during data reading. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 1 to 2 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

numCopiesInput

Number of training data copies. This is a newly added parameter related to the implementation of the optimization algorithm. It is set by the spark.boostkit.ml.rf.numTrainingDataCopies parameter.

[5, 10]

pt

Number of partitions used in the training phase of the optimization algorithm. This parameter is specified by the spark.boostkit.ml.rf.numPartsPerTrainingDataCopy parameter.

The recommended value is the round-down value of the total number of cores divided by the number of copies.

featuresType

Storage format of features in the training sample data. This is a newly added parameter related to the implementation of the optimization algorithm. This parameter is set by the spark.boostkit.ml.rf.binnedFeaturesDataType parameter.

String. The value can be array (default) or fasthashmap. You are advised to set this parameter to fasthashmap when the dimension is high.

broadcastVariables

Indicates whether to broadcast variables with large storage space. This is a newly added parameter related to the implementation of the optimization algorithm. This parameter is set by the spark.boostkit.ml.rf.broadcastVariables parameter.

Boolean. The default value is false. You are advised to set this parameter to true when the dimension is high.

copyStrategy

Copy allocation policy. The value can be normal or plus. The default value is normal. If the running time of a task in the training phase is much longer than that of other tasks, set this parameter to plus by using the spark.boostkit.ml.rf.copyStrategy parameter.

The default value is normal.

numFeaturesOptFindSplits

Threshold of the feature dimension. When the feature dimension of a dataset is greater than the specified parameter value, the search optimization of the high-dimensional feature segmentation point is triggered. The default value is 8196, which is specified by the spark.boostkit.ml.rf.numFeaturesOptimizeFindSplitsThreshold parameter.

If the proportion of the stage for feature segmentation point search to the total duration is large, decrease the threshold to trigger high-dimensional optimization in advance.

maxDepth

Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence will be slowed down and overfitting may occur.

[3, 100]. The recommended value range is 11 to 15.

maxBins

Maximum number of features considered during the division of each subtree. A larger value indicates a more accurate solution. However, if the value is too large, the subtree generation time is prolonged.

The value 128 is recommended.

maxMemoryInMB

Maximum size of the memory for storing statistics. A larger value indicates that less data can be traversed, and thus improving the training speed. However, increasing the value increases the communication overhead of each iteration.

The value 2048 or 4096 is recommended. Increase the value for datasets of high dimensions. For example, set this parameter to 10240 for these datasets.

Logistic Regression

This part describes the impact of Logistic Regression algorithm parameters on the model performance.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

Linear Regression

This part describes the impact of Linear Regression algorithm parameters on the model performance.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

XGBoost

This part describes the impact of XGBoost algorithm parameters on the model performance.

Parameter

Description

Suggestion

spark.task.cpus

Number of CPU cores allocated to each task.

Keep the same as the value of executor_cores.

max_depth

Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence will be slowed down and overfitting may occur.

[3, 100]. The default value is 6.

enable_bbgen

Indicates whether to use the batch Bernoulli bit generation algorithm. Setting this parameter to true improves the sampling and training performance.

The recommended value is true.

rabit_enable_tcp_no_delay

Controls the communication policy in the Rabit engine. Usually this parameter is set to true to improve the training performance.

The recommended value is true.

num_workers

Total number of tasks when the XGBoost algorithm is executed.

Keep the same as the value of num-executors. If the value of this parameter exceeds that of num-executors, the algorithm may fail to be executed.

nthread

Number of concurrent threads for each task when the XGBoost algorithm is used.

Keep the same as the value of executor_cores.

grow_policy

The depthwiselossltd option is added to control the method of adding new tree nodes to the tree. This parameter takes effect only when tree_method is set to hist. The value depends on the specific training data. Generally, depthwise brings higher precision but increases the training duration. lossguide works opposite to depthwise. depthwiselossltd falls in between of depthwise and lossguide. You can adjust through configuration.

String. The default value is depthwise. The options are depthwise, lossguide, and depthwiselossltd.

min_loss_ratio

Controls the pruning degree of tree nodes during training. This parameter takes effect only when grow_policy is set to depthwiselossltd. A larger value indicates more pruning operations, faster speed, and lower precision.

Double. Default value: 0. Value range: [0, 1).

sampling_strategy

Controls the sampling policy during training. The sampling frequency in descending order is: eachTree > eachIteration > multiIteration > alliteration. Lower sampling frequency means fewer sampling time overheads and lower accuracy. gossStyle is a gradient-based sampling, which has a higher cost and precision.

String. The default value is eachTree. The options are eachTree, eachIteration, alliteration, multiIteration, and gossStyle.

sampling_step

Controls the number of sampling rounds. This parameter is valid only when sampling_strategy is set to multiIteration. Larger interval means lower sampling frequency, fewer sampling overheads and lower accuracy.

Int. Default value: 1. Value range: [1, +∞).

auto_subsample

Indicates whether to use the policy of automatically reducing the sampling rate. After this function is enabled, the system automatically attempts to use a smaller sampling rate. The sampling rate search process may cause time overheads. Using a proper small sampling rate reduces the training time overheads.

Boolean. The value can be true or false. The default value is false.

auto_k

Controls the number of rounds in the automatic sampling rate reduction policy. This parameter is valid only when auto_subsample is set to true. A larger value indicates longer sampling rate search duration but more accurate search result.

Int. Default value: 1. Value range: [1, +∞).

auto_subsample_ratio

Sets the ratio of automatic sampling rate decrease. The value is an array. Array elements are sorted in ascending order. The more elements in the array, the more times the system attempts to search for the sampling rate. The time overhead may increase and the search result may be more accurate. The smaller each element in the array is, the smaller the value of the sampling rate to be searched for is.

Array[Double]. Default value: Array(0.05,0.1,0.2,0.4,0.8,1.0). Value range: (0, 1].

auto_r

Controls the allowed error rate increase caused by the automatic reduction of the sampling rate. A smaller value indicates a higher error rate.

Double. Default value: 0.95. Value range: (0, 1].

random_split_denom

Controls the proportion of candidate split points. A larger value indicates a shorter training duration and a larger error.

Int. Default value: 1. Value range: [1, +∞).

default_direction

Controls the default direction of default values. The default value is learn. If left or right is selected, the training duration and accuracy may decrease.

String. The value can be left, right, or learn.

KNN

This part describes the impact of KNN algorithm parameters on the model performance.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

testBatchSize

Number of samples calculated at a time in the inference phase. A larger value indicates higher memory usage. This is a newly added parameter. You can use the KNNModel.setTestBatchSize() method to transfer parameters in the transform phase.

Positive Int. The default value is 1024.