Algorithm Parameter Tuning

GBDT

Samples differ by application scenarios. Therefore, an algorithm with the same parameters can have different performance in different scenarios. The performance is mainly reflected in the prediction precision and algorithm convergence speed. Some parameters can be adjusted to accelerate the convergence and thus improve the model precision and the overall algorithm performance. The following uses the GBDT algorithm as an example to describe the impact of each parameter on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
maxIter	Maximum number of iterations or subtrees. A small value may cause underfitting; and a large value may cause overfitting and slows down the model convergence.	Search for the parameter within [80, 120]. The default value 100 is recommended.
stepSize	Learning rate, that is, the update range of the subtree weight in each iteration. If the value is too large, convergence flapping occurs and the convergence fails. If the value is too small, the convergence is slow, and locally optimal results might be returned.	Search for the parameter within [0, 1]. The value 0.1 is recommended.
maxDepth	Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence will be slowed down and overfitting may occur.	[3, 100]. The value 5 is recommended.
maxBins	Maximum number of features considered during the division of each subtree, depending on the total number (N) of features. If the value is too large, the subtree generation time is affected.	The value is recommended.
doUseAcc	Whether to enable the feature parallel training mode. The value can be True (feature parallel mode) or False (sample parallel mode).	Boolean type. The default value is True.

RF

This section describes the impact of RF algorithm parameters on the model performance.

Parameter	Description	Suggestion
genericPt	Number of re-partitions during data reading. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 1 to 2 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
numCopiesInput	Number of training data copies. This is a newly added parameter and is specified by the spark.sophon.ml.rf.numTrainingDataCopies parameter. Increasing the value increases the degree of parallelism and memory usage and overhead.	The default value is 1.
pt	Number of partitions used in the training phase of the optimization algorithm. This parameter is set by the spark.sophon.ml.rf.numPartsPerTrainingDataCopy parameter.	The recommended value is the round-down value of the total number of cores divided by the number of copies.
featuresType	Storage format of features in the training sample data. This is a newly added parameter and is specified by the spark.sophon.ml.rf.binnedFeaturesDataType parameter. The value is of the enumerated type and can be array or fasthashmap. The default value is array.	The value fasthashmap is recommended if the number of dimensions is greater than 5000. The value array is recommended if the number of dimensions is less than 5000.
broadcastVariables	Whether to broadcast variables with large storage space. This is a newly added parameter and is specified by the spark.sophon.ml.rf.broadcastVariables parameter. The default value is false.	Set this parameter to true if the number of dimensions is greater than 10000.
maxDepth	Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence will be slowed down and overfitting may occur.	[3, 100]. The recommended value range is 11 to 15.
maxBins	Maximum number of features considered during the division of each subtree. A larger value indicates a more accurate solution. However, if the value is too large, the subtree generation time is affected.	The value 128 is recommended.
maxMemoryInMB	Maximum size of the memory for storing statistics. A larger value indicates that less data can be traversed, and thus improving the training speed. However, increasing the value increases the communication overhead of each iteration.	The value 2048 or 4096 is recommended. Increase the value for datasets of high dimensions. For example, set this parameter to 10240 for these datasets.

SVM

This section describes the impact of SVM algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
maxIter	Maximum number of iterations. If the value is too large, the training time is too long and the model may be overfitted, reducing the accuracy. If the value is too small, the model cannot be converged to the optimal value and the accuracy is low.	Search for the parameter within [50, 150]. The default value 100 is recommended. Reduce the number of iterations for a dataset with a small number of features.
inertiaCoefficient	Weight of the historical direction information in momentum calculation. This is a newly added parameter and is set by the spark.sophon.LinearSVC.inertiaCoefficient parameter. This parameter is a positive real number of the double-precision type and is used to optimize the accuracy.	The default value is 0.5.

K-means

This section describes the impact of K-means algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
k	Number of clustering centers	Try more values to obtain a better clustering effect.
maxIter	Maximum number of iterations	Use the default value for a convex dataset. For a non-convex dataset, the algorithm is difficult to converge. In this case, specify the maximum number of iterations so that the algorithm can exit the loop in a timely manner.
initSteps	Number of times the algorithm is run with different initialized centroids	Run the algorithm for multiple times to find a value with a better clustering effect. The default value is 10. Generally, you do not need to change the value. If the value of the k parameter is large, increase this value.
optMethod	Whether to trigger sampling. This is a newly added parameter and is specified by the spark.sophon.Kmeans.optMethod parameter. The value can be default (trigger sampling) or allData (not trigger sampling).	The default value is default.
sampleRate	Ratio of the data used in each iteration to the full data set. This is a newly added parameter which can affect the calculation efficiency and clustering error. Decreasing the value increases the computing efficiency but may also increase the clustering error. This parameter is set by the spark.sophon.Kmeans.sampleRate parameter.	The default value is 0.05.

DecisionTree

This section describes the impact of DecisionTree algorithm parameters on the model performance.

Parameter	Description	Suggestion
genericPt	Number of re-partitions during data reading. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 1 to 2 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
numCopiesInput	Number of training data copies. This is a newly added parameter related to the implementation of the optimization algorithm. It is set by the spark.sophon.ml.rf.numTrainingDataCopies parameter.	[5, 10].
pt	Number of partitions used in the training phase of the optimization algorithm. This parameter is set by the spark.sophon.ml.rf.numPartsPerTrainingDataCopy parameter.	The recommended value is the round-down value of the total number of cores divided by the number of copies.
featuresType	Storage format of features in the training sample data. This is a newly added parameter related to the implementation of the optimization algorithm. This parameter is set by the spark.sophon.ml.rf.binnedFeaturesDataType parameter.	String. The value can be array (default) or fasthashmap. You are advised to set this parameter to fasthashmap when the dimension is high.
broadcastVariables	Whether to broadcast variables with large storage space. This parameter is a newly added parameter related to the implementation of the optimization algorithm. This parameter is set by the spark.sophon.ml.rf.broadcastVariables parameter.	Boolean type. The default value is false. You are advised to set this parameter to true when the dimension is high.
copyStrategy	Copy allocation policy. The value can be normal or plus. The default value is normal. If the running time of a task in the training phase is much longer than that of other tasks, set this parameter to plus by the spark.sophon.ml.rf.copyStrategy parameter.	The default value is normal.
numFeaturesOptFindSplits	Threshold of the feature dimension. When the feature dimension of a dataset is greater than the value of this parameter, the search optimization of the high-dimensional feature segmentation point is triggered. The default value is 8196, which is specified by the spark.sophon.ml.rf.numFeaturesOptimizeFindSplitsThreshold parameter.	If the proportion of the stage for feature segmentation point search to the total duration is large, decrease the threshold to trigger high-dimensional optimization in advance.
maxDepth	Maximum depth of each subtree, depending on the number of sample features. Do not set this parameter to a too large value even if there are many features. Otherwise, the convergence will be slowed down and overfitting may occur.	[3, 100]. The recommended value range is 11 to 15.
maxBins	Maximum number of features considered during the division of each subtree. A larger value indicates a more accurate solution. However, if the value is too large, the subtree generation time is affected.	The recommended value is 128.
maxMemoryInMB	Maximum size of the memory for storing statistics. A larger value indicates that less data can be traversed, and thus improving the training speed. However, increasing the value increases the communication overhead of each iteration.	The value 2048 or 4096 is recommended. Increase the value for datasets of high dimensions. For example, set this parameter to 10240 for these datasets.

LogisticRegression

This section describes the impact of LogisticRegression algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

PCA

This section describes the impact of PCA algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

SVD

This section describes the impact of SVD algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

LDA

This section describes the impact of LDA algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
spark.task.cpus	Number of CPU cores allocated to each task	Keep the same as the value of executor_cores.
spark.driver.cores	Number of CPU cores allocated to the driver process	Set the parameter based on the actual number of CPU cores.

PrefixSpan

This section describes the impact of PrefixSpan algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
localTimeout	Timeout interval for local processing, in seconds. This is a newly added parameter and is set by the spark.sophon.ml.ps.localTimeout parameter.	If local resolution takes longer than other phases, set this parameter to a smaller value. The recommended value is 300.
filterCandidates	Whether to filter the prefix candidate set. This is a newly added parameter and is set by the spark.sophon.ml.ps.filterCandidates parameter. If this parameter is enabled, the communication volume decreases and the calculation workload increases.	Boolean type. The default value is false.
projDBStep	Attenuation rate of the projection data volume. This is a newly added parameter and is set by the spark.sophon.ml.ps.projDBStep parameter. Retain the default value. A larger value means less calculation workload of the local processing.	Double type. The default value is 10.

ALS

This section describes the impact of ALS algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
blockMaxRow	Row block size of a Grime matrix, which is related to the L1 cache size and affects the calculation performance of the local matrix. This is a newly added parameter and is set by the spark.sophon.ALS.blockMaxRow parameter.	Positive integer. The default value is 16. You are advised to retain the default value.
unpersistCycle	Unpersist period for srcFactorRDD. When the number of iterations reaches the value of this parameter, the accumulated srcFactorRDD is unpersisted and the memory is released. The smaller value indicates that the memory is released more frequently. This is a newly added parameter and is set by the spark.sophon.ALS.unpersistCycle parameter.	Positive integer. The default value is 300. You are advised to retain the default value.

KNN

This section describes the impact of SVD algorithm parameters on the model performance.

Parameter	Description	Suggestion
numPartitions	Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.	Perform grid search by using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.
testBatchSize	Number of samples calculated at a time in the inference phase. A larger value indicates higher memory usage. This is a newly added parameter. You can use the KNNModel.setTestBatchSize() method to transfer parameters in the transform phase.	Positive integer. The default value is 1024.

Parent topic: Algorithm Performance Tuning Guide