Feature Engineering
DTB
This part describes the impact of DTB algorithm parameters on the model performance. The default configuration file directory is $KAL_TEST/conf/ml/dtb, in which $KAL_TEST/conf/ is the kal-test tool deployment directory.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
Word2Vec
This part describes the impact of Word2Vec algorithm parameters on the model performance. The default configuration file directory is $KAL_TEST/conf/ml/word2vec, in which $KAL_TEST/conf/ is the kal-test tool deployment directory.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |