Other Tuning Items

Purpose

Adjust the number of partitions based on the number of cores to ensure that the data volume processed by each core is the same as much as possible. This helps minimize data skew and prevents excessive processing time on a single core.

Procedure

In this scenario, you can set the number of partitions and parallelism to three to five times the total number of CPU cores. This helps reduce the size of files processed by each task and improve performance. You can use the following partition settings:
1 2
spark.sql.shuffle.partitions 1000 spark.default.parallelism 2000
Based on the actual environment, adjust the number of running cores and memory size specified by HiBench in the configuration file to achieve the optimal performance. For example, for the Kunpeng 920 processor, the following executor parameters are recommended for TeraSort.
1 2 3 4
yarn.executor.num 27 yarn.executor.cores 7 spark.executor.memory 25G spark.driver.memory 36G

Parent topic: TeraSort (I/O- + CPU-intensive)