Algorithm Parameter Tuning
GBDT
Samples differ by application scenarios. Therefore, an algorithm with the same parameters can have different performance in different scenarios. The performance is mainly reflected in the prediction precision and algorithm convergence speed. Some parameters can be adjusted to accelerate the convergence and thus improve the model precision and the overall algorithm performance. The following uses the GBDT algorithm as an example to describe the impact of each parameter on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
maxIter |
Maximum number of iterations or subtrees. If the value is too small, underfitting may occur. If the value is too large, overfitting may occur and the model convergence speed is slowed down. |
Search for the parameter within [80, 120]. The default value 100 is recommended. |
stepSize |
Learning rate, that is, the update range of the subtree weight in each iteration. If the value is too large, convergence flapping occurs and the convergence fails. If the value is too small, the convergence is slow, and locally optimal results might be returned. |
Search for the parameter within [0, 1]. The value 0.1 is recommended. |
maxDepth |
Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence is slowed down and overfitting may occur. |
[3, 100]. The value 5 is recommended. |
maxBins |
Maximum number of features considered during the division of each subtree, depending on the total number (N) of features. If the value is too large, the subtree generation time is affected. |
The value of |
doUseAcc |
Whether to enable the feature parallel training mode. The value can be True (feature parallel mode) or False (sample parallel mode). |
Boolean type. The default value is True. |
RF
This section describes the impact of RF algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
genericPt |
Number of re-partitions during data reading. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 1 to 2 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
numCopiesInput |
Number of training data copies. This is a newly added parameter and is set by the spark.boostkit.ml.rf.numTrainingDataCopies parameter. Increasing the value increases the degree of parallelism and memory usage and overhead. |
The default value is 1. |
pt |
Number of partitions used in the training phase of the optimization algorithm. This parameter is set by the spark.boostkit.ml.rf.numPartsPerTrainingDataCopy parameter. |
The recommended value is the round-down value of the total number of cores divided by the number of copies. |
featuresType |
Storage format of features in the training sample data. This is a newly added parameter and is set by the spark.boostkit.ml.rf.binnedFeaturesDataType parameter. The value is of the enumerated type and can be array or fasthashmap. The default value is array. |
The value fasthashmap is recommended if the number of dimensions is greater than 5000. The value array is recommended if the number of dimensions is less than 5000. |
broadcastVariables |
Whether to broadcast variables with large storage space. This is a newly added parameter and is set by the spark.boostkit.ml.rf.broadcastVariables parameter. The default value is false. |
Set this parameter to true if the number of dimensions is greater than 10000. |
maxDepth |
Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence is slowed down and overfitting may occur. |
[3, 100]. The recommended value range is 11 to 15. |
maxBins |
Maximum number of features considered during the division of each subtree. A larger value indicates a more accurate solution. However, if the value is too large, the subtree generation time is affected. |
The value 128 is recommended. |
maxMemoryInMB |
Maximum size of the memory for storing statistics. A larger value indicates that less data can be traversed, and thus improving the training speed. However, increasing the value increases the communication overhead of each iteration. |
The value 2048 or 4096 is recommended. Increase the value for datasets of high dimensions. For example, set this parameter to 10240 for these datasets. |
SVM
This section describes the impact of SVM algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
maxIter |
Maximum number of iterations. If the value is too large, the training time is too long and the model may be overfitted, reducing the precision. If the value is too small, the model cannot be converged to the optimal value and the precision is low. |
Search for the parameter within [50, 150]. The default value 100 is recommended. Reduce the number of iterations for a dataset with a small number of features. |
inertiaCoefficient |
Weight of the historical direction information in momentum calculation. This is a newly added parameter and is set by the spark.boostkit.LinearSVC.inertiaCoefficient parameter. This parameter is a positive real number of the double-precision type and is used to optimize the precision. |
The default value is 0.5. |
K-means
This section describes the impact of K-means algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
k |
Number of clustering centers |
Try more values to obtain a better clustering effect. |
maxIter |
Maximum number of iterations. |
Use the default value for a convex dataset. For a non-convex dataset, the algorithm is difficult to converge. In this case, specify the maximum number of iterations so that the algorithm can exit the loop in a timely manner. |
initSteps |
Number of times the algorithm is run with different initialized centroids |
Run the algorithm for multiple times to find a value with a better clustering effect. The default value is 10. Generally, you do not need to change the value. If the value of the k parameter is large, increase this value. |
optMethod |
Whether to trigger sampling. This is a newly added parameter and is set by the spark.boostkit.Kmeans.optMethod parameter. The value can be default (trigger sampling) or allData (not trigger sampling). |
The default value is default. |
sampleRate |
Ratio of the data used in each iteration to the full data set. This is a newly added parameter which can affect the calculation efficiency and clustering error. Decreasing the value increases the computing efficiency but may also increase the clustering error. This parameter is set by the spark.boostkit.Kmeans.sampleRate parameter. |
The default value is 0.05. |
DecisionTree
This section describes the impact of DecisionTree algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
genericPt |
Number of re-partitions during data reading. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 1 to 2 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
numCopiesInput |
Number of training data copies. This is a newly added parameter related to the implementation of the optimization algorithm. It is set by the spark.boostkit.ml.rf.numTrainingDataCopies parameter. |
[5, 10]. |
pt |
Number of partitions used in the training phase of the optimization algorithm. This parameter is set by the spark.boostkit.ml.rf.numPartsPerTrainingDataCopy parameter. |
The recommended value is the round-down value of the total number of cores divided by the number of copies. |
featuresType |
Storage format of features in the training sample data. This is a newly added parameter related to the implementation of the optimization algorithm. This parameter is set by the spark.boostkit.ml.rf.binnedFeaturesDataType parameter. |
String. The value can be array (default) or fasthashmap. You are advised to set this parameter to fasthashmap when the dimension is high. |
broadcastVariables |
Whether to broadcast variables with large storage space. This is a newly added parameter related to the implementation of the optimization algorithm. This parameter is set by the spark.boostkit.ml.rf.broadcastVariables parameter. |
Boolean. The default value is false. You are advised to set this parameter to true when the dimension is high. |
copyStrategy |
Copy allocation policy. The value can be normal or plus. The default value is normal. If the running time of a task in the training phase is much longer than that of other tasks, set this parameter to plus by the spark.boostkit.ml.rf.copyStrategy parameter. |
The default value is normal. |
numFeaturesOptFindSplits |
Threshold of the feature dimension. When the feature dimension of a dataset is greater than the value of this parameter, the search optimization of the high-dimensional feature segmentation point is triggered. The default value is 8196, which is specified by the spark.boostkit.ml.rf.numFeaturesOptimizeFindSplitsThreshold parameter. |
If the proportion of the stage for feature segmentation point search to the total duration is large, decrease the threshold to trigger high-dimensional optimization in advance. |
maxDepth |
Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence is slowed down and overfitting may occur. |
[3, 100]. The recommended value range is 11 to 15. |
maxBins |
Maximum number of features considered during the division of each subtree. A larger value indicates a more accurate solution. However, if the value is too large, the subtree generation time is affected. |
The value 128 is recommended. |
maxMemoryInMB |
Maximum size of the memory for storing statistics. A larger value indicates that less data can be traversed, and thus improving the training speed. However, increasing the value increases the communication overhead of each iteration. |
The recommended value is 2048 or 4096. Increase the value for datasets of high dimensions. For example, set this parameter to 10240 for these datasets. |
LogisticRegression
This section describes the impact of LogisticRegression algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
LinearRegression
This section describes the impact of LinearRegression algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
PCA
This section describes the impact of PCA algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
SVD
This section describes the impact of SVD algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
LDA
This section describes the impact of LDA algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
spark.task.cpus |
Number of CPU cores allocated to each task. |
Keep the same as the value of executor_cores. |
spark.driver.cores |
Number of CPU cores allocated to the driver process. |
Set the parameter based on the actual number of CPU cores. |
PrefixSpan
This section describes the impact of PrefixSpan algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
localTimeout |
Timeout interval for local resolution. This is a newly added parameter and is set by the spark.boostkit.ml.ps.localTimeout parameter. The unit is second. |
If local resolution takes longer than other phases, set this parameter to a smaller value. The recommended value is 300. |
filterCandidates |
Whether to filter the prefix candidate set. This is a newly added parameter and is set by the spark.boostkit.ml.ps.filterCandidates parameter. If this parameter is enabled, the communication volume decreases and the calculation workload increases. |
Boolean. The default value is false. |
projDBStep |
Attenuation rate of the projection data volume. This is a newly added parameter and is set by the spark.boostkit.ml.ps.projDBStep parameter. Generally, the default value is used. The larger the value, the less the calculation workload of the local resolution. |
Double. The default value is 10. |
ALS
This section describes the impact of ALS algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
blockMaxRow |
Row block size of the Grime matrix, which is related to the L1 cache size and affects the calculation performance of the local matrix. This is a newly added parameter and is set by the spark.boostkit.ALS.blockMaxRow parameter. |
Positive integer. The default value is 16. You are advised to retain the default value. |
unpersistCycle |
Unpersist period for srcFactorRDD. When the number of iterations reaches the value of this parameter, the accumulated srcFactorRDD is unpersisted and the memory is released. The smaller value indicates that the memory is released more frequently. This is a newly added parameter and is set by the spark.boostkit.ALS.unpersistCycle parameter. |
Positive integer. The default value is 300. You are advised to retain the default value. |
KNN
This section describes the impact of SVD algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
testBatchSize |
Number of samples calculated at a time in the inference phase. A larger value indicates higher memory usage. This is a newly added parameter. You can use the KNNModel.setTestBatchSize() method to transfer parameters in the transform phase. |
Positive integer. The default value is 1024. |
Covariance
This section describes the impact of Covariance algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
DBSCAN
This section describes the impact of DBSCAN algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. |
It is recommended that the value of numPartitions be the same as the number of executors. (You can decrease the number of executors and increase the resource configuration of a single executor to improve the performance.) |
epsilon |
Nearest neighbor distance of the DBSCAN algorithm. |
The value is greater than 0.0. |
minPoints |
The threshold parameter for the DBSCAN algorithm defines the number of adjacent points of a core point. |
Positive integer |
sampleRate |
sampleRate indicates the sampling rate of the input data. It is used to divide the space of the full input data based on the sampling data. |
The value range is (0.0, 1.0]. The default value is 1.0, indicating that full input data is used by default. |
Pearson
This section describes the impact of Pearson algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
Spearman
This section describes the impact of Spearman algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
XGBoost
This section describes the impact of XGBoost algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
spark.task.cpus |
Number of CPU cores allocated to each task. |
Keep the same as the value of executor_cores. |
max_depth |
Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence is slowed down and overfitting may occur. |
The value ranges from 3 to 100. The default value is 6. |
enable_bbgen |
Whether to use the batch Bernoulli bit generation algorithm. If this parameter is set to true, the sampling performance is improved, and then the training performance is improved. |
The recommended value is true. |
rabit_enable_tcp_no_delay |
Controls the communication policy in the Rabit engine. Usually this parameter is set to true to improve the training performance. |
The recommended value is true. |
num_workers |
Total number of tasks when the XGboost algorithm is executed. |
Set this parameter to the value of num-executors. If the value of this parameter exceeds that of num-executors, the algorithm may fail to be executed. |
nthread |
Number of concurrent threads for each task when the XGboost algorithm is used. |
Set this parameter to the value of executor-cores. |
grow_policy |
The depthwiselossltd option is added to control the method of adding new tree nodes to the tree. This parameter takes effect only when tree_method is set to hist. The value depends on the specific training data. Generally, depthwise brings higher precision and increased training duration. lossguide is opposite to depthwise. depthwiselossltd falls in between of depthwise and lossguide. You can adjust through configuration. |
String. The default value is depthwise. The options are depthwise, lossguide, and depthwiselossltd. |
min_loss_ratio |
Controls the pruning degree of tree nodes during training. This parameter takes effect only when grow_policy is set to depthwiselossltd. A larger value indicates more pruning operations, faster speed, and lower precision. |
Double. Default value: 0. Value range: [0, 1). |
sampling_strategy |
Controls the sampling policy during training. The sampling frequency in descending order is: eachTree > eachIteration > multiIteration > alliteration. Lower sampling frequency means fewer sampling time overheads and lower precision. gossStyle is a gradient-based sampling, which has a higher cost and precision. |
String. The default value is eachTree. The options are eachTree, eachIteration, alliteration, multiIteration, and gossStyle. |
sampling_step |
Controls the number of sampling rounds. This parameter is valid only when sampling_strategy is set to multiIteration. Larger interval means lower sampling frequency, fewer sampling overheads and lower precision. |
Int. Default value: 1. Value range: [1, +∞). |
auto_subsample |
Determines whether to use the policy of automatically reducing the sampling rate. After this function is enabled, the system automatically attempts to use a smaller sampling rate for sampling. The sampling rate search process may cause time overheads. If a proper small sampling rate is found, the training process time overheads are reduced. |
Boolean. The value can be true or false. The default value is false. |
auto_k |
Controls the number of rounds in the automatic sampling rate reduction policy. This parameter is valid only when auto_subsample is set to true. The larger the value is, the longer the sampling rate search time is, but the more accurate the search result is. |
Int. Default value: 1. Value range: [1, +∞). |
auto_subsample_ratio |
Sets the ratio of automatical sampling rate decrease. The value is an array. Array elements are sorted in ascending order. The more elements in the array, the more times the system attempts to search for the sampling rate. The time overhead may increase and the search result may be more accurate. The smaller each element in the array is, the smaller the value of the sampling rate to be searched for is. |
Array[Double]. Default value: Array(0.05,0.1,0.2,0.4,0.8,1.0). Value range: (0, 1]. |
auto_r |
Controls the allowed error rate increase caused by the automatic reduction of the sampling rate. A smaller value indicates a higher error rate. |
Double. Default value: 0.95. Value range: (0, 1]. |
random_split_denom |
Controls the proportion of candidate split points. A larger value indicates a shorter training duration and a larger error. |
Int. Default value: 1. Value range: [1, +∞). |
default_direction |
Controls the default direction of default values. The default value is learn. If left or right is selected, the training duration and precision may decrease. |
String. The value can be left, right, or learn. |
