Other Tuning Items
Purpose
Adjust the number of partitions based on the number of cores to ensure that the data volume processed by each core is the same as much as possible. This helps minimize data skew and prevents excessive processing time on a single core.
Procedure
- In this scenario, you can set the number of partitions and parallelism to three to five times the total number of CPU cores. This helps reduce the size of files processed by each task and improve performance. You can use the following partition settings:
1 2
spark.sql.shuffle.partitions 1000 spark.default.parallelism 2000
- Based on the actual environment, adjust the number of running cores and memory size specified by HiBench in the configuration file to achieve the optimal performance. For example, for the Kunpeng 920 5220 processor, the following executor parameters are recommended for TeraSort.
1 2 3 4
yarn.executor.num 27 yarn.executor.cores 7 spark.executor.memory 25G spark.driver.memory 36G
Parent topic: TeraSort (I/O- + CPU-intensive)