Rate This Document
Findability
Accuracy
Completeness
Readability

Tuning HiBench

Purpose

Properly configure HiBench test parameters and cluster resources to fully leverage the Kunpeng multi-core architecture. This enhances the execution efficiency and overall performance across HiBench scenarios by increasing task parallelism and overall throughput.

Procedure

In HiBench tests, optimization begins with the data import phase, which significantly impacts overall performance. Key parameters include hibench.default.map.parallelism and hibench.default.shuffle.parallelism in the conf/hibench.conf file. These two parameters determine the number of data partitions, which in turn governs the parallelism of the Mapper and Reducer phases. A higher partition count results in a larger number of smaller data files, facilitating more efficient parallel processing. According to Spark's official recommendations for Yarn-managed clusters, each parallelism parameter should be set to 2 to 3 times the total number of CPU cores, allowing each core to handle 2–3 tasks.

  1. Optimize during the data import phase.
    1. Verify cluster resources to provide a basis for subsequent parameter settings.
      1. Identify the number of cluster nodes and CPU cores per node.
      2. Calculate total cluster cores = Number of nodes × CPU cores per node.

        Example: If a cluster has 10 nodes with 32 cores each, the total core count is 320. The recommended initial parallelism is 640 to 960 (2 to 3 times the total cores)

    2. Set the parallelism parameters in the conf/hibench.conf file to the following values:
      hibench.default.map.parallelism=640
      hibench.default.shuffle.parallelism=640

      In scenarios with large volumes of intermediate data (for example, in TeraSort), each parallelism parameter can be increased to 3 to 5 times the total cores to reduce the input volume per task and improve processing efficiency.

    3. Execute the data import. Verify the actual effect after parameter configuration and observe whether the data partitioning meets expectations.
      1. Run the HiBench data import command (for example, terasort or wordcount).
      2. After completion, access Utils > Filesystem Explorer on the HDFS NameNode interface.
      3. Check the number of files in the corresponding HiBench data directory; it should match the configured parallelism value.
      4. Observe the file size: More partitions result in smaller individual files, which facilitates parallel task processing.
  2. Select advantageous scenarios.

    In HiBench testing, compute-intensive scenarios such as WordCount, TeraSort, Bayesian, and K-means are better suited for Kunpeng clusters. Compared to SQL-based scenarios, these workloads involve more complex CPU calculation stages under similar data I/O conditions, better demonstrating the advantages of multi-core parallelism.

  3. To maximize hardware performance, check that the hardware environment is properly configured and deployed based on actual conditions (for example, RAID controller cache policies, network interface configurations, and block device settings).
  4. Optimize the software configuration.
    1. Foundational configuration: Maintain consistent foundational configuration standards between the client and the web service.
    2. Executor parameters: Derive the optimal number and configuration of executors based on the total cluster cores using theoretical formulas to ensure full resource utilization.
    3. Affinity configuration: Perform CPU affinity tuning on the Kunpeng platform to enhance task scheduling efficiency and execution performance.
  5. Optimize the data partitioning strategy: The goal is to reduce file granularity and increase parallelism, preventing performance bottlenecks caused by a single task processing excessive data.
    1. General scenarios: Set the number of data partitions to 2 to 3 times the total cluster cores.
    2. Scenarios such as TeraSort with significant intermediate data: It is advised to set partitions to 3 to 5 times the core count to reduce the data volume per task and improve execution efficiency.