Rate This Document
Findability
Accuracy
Completeness
Readability

Feature Engineering

DTB

This part describes the impact of DTB algorithm parameters on the model performance. The default configuration file directory is $KAL_TEST/conf/ml/dtb, in which $KAL_TEST/conf/ is the kal-test tool deployment directory.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.

Word2Vec

This part describes the impact of Word2Vec algorithm parameters on the model performance. The default configuration file directory is $KAL_TEST/conf/ml/word2vec, in which $KAL_TEST/conf/ is the kal-test tool deployment directory.

Parameter

Description

Suggestion

numPartitions

Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases.

Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores.