Feature Engineering
PCA
This part describes the impact of PCA algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
SVD
This part describes the impact of SVD algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
Covariance
This part describes the impact of Covariance algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
Pearson
This part describes the impact of Pearson algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
Spearman
This part describes the impact of Spearman algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
DTB
This part describes the impact of DTB algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |
Word2Vec
This part describes the impact of Word2Vec algorithm parameters on the model performance.
Parameter |
Description |
Suggestion |
|---|---|---|
numPartitions |
Number of Spark partitions. If the number is large, there is a large number of tasks, and the scheduling time increases. If the number of partitions is too small, tasks may not be allocated to some nodes and the data volume processed by each partition increases. As a result, the memory usage of each agent node increases. |
Perform a grid search using 0.5 to 1.5 times of the total number of cores (the product of executor_cores multiplied by num_executor). You are advised to perform the grid search based on the total number of cores. |