K-means (CPU-intensive)
Purpose
k-means is a CPU-intensive computing task. You can adjust I/O parameters and Spark executor parameters for the optimal performance.
Procedure
- Adjust the Spark executor parameters to proper values. In this scenario, you can use the following partition settings:
1 2
spark.sql.shuffle.partitions 1000 spark.default.parallelism 2500
- Use the following kernel parameter:
1echo 4096 > /sys/block/sd$i/queue/read_ahead_kb
- Based on the actual environment, adjust the number of running cores and memory size specified by HiBench in the configuration file to achieve the optimal performance. For example, for the Kunpeng 920 5220 processor, the following executor parameters are recommended for K-means.
1 2 3 4 5
yarn.executor.num 42 yarn.executor.cores 6 spark.executor.memory 15G spark.driver.memory 36G spark.locality.wait 10s
- Adjust the JDK parameters and add the following configurations to the spark-default.conf file:
1 2
spark.executor.extraJavaOptions -XX:+UseNUMA -XX:BoxTypeCachedMax=100000 -XX:ParScavengePerStrideChunk=8192 spark.yarn.am.extraJavaOptions -XX:+UseNUMA -XX:BoxTypeCachedMax=100000 -XX:ParScavengePerStrideChunk=8192
BiSheng JDK provides specific optimizations for K-means. You can use BiSheng JDK to accelerate the execution. Perform the following steps to replace the original JDK with BiSheng JDK:
- Download BiSheng JDK.
- Replace the existing JDK with BiSheng JDK.
- Stop the cluster service to prevent service exceptions caused by JDK switch.
- Extract the BiSheng JDK package and move it to /usr/local/.
1 2
tar -zxvf bisheng-jdk-8u262-linux-aarch64.tar.gz mv bisheng-jdk1.8.0_262 /usr/local/
- Rename the original JDK directory and replace it, and modify the directory permission.
Rename the original JDK directory, for example, rename jdk8u222-b10/ to jdk8u222-b10-openjdk/. Then rename the BiSheng JDK directory to the same directory name as the original JDK directory, for example, jdk8u222-b10/.
1 2 3
mv jdk8u222-b10/ jdk8u222-b10-openjdk/ mv bisheng-jdk1.8.0_262/ jdk8u222-b10/ chmod -R 755 jdk8u222-b10/
- Restart the cluster service to ensure that the JDK changes take effect.
- Modify the Spark configuration to make full use of the optimization features of BiSheng JDK.
- Open the Spark configuration file.
1vi /opt/HiBench-HiBench-7.0/conf/spark.conf - Press i to enter the insert mode. Change the value of spark.executor.extraJavaOptions.
1 2 3
spark.executor.extraJavaOptions -XX:+UnlockExperimentalVMOptions -XX:+EnableIntrinsicExternal -XX:+UseF2jBLASIntrinsics -Xms43g -XX:ParallelGCThreads=8
The value of -Xms (for example, 43g) should be adjusted according to the value of spark.executor.memory in /opt/HiBench-HiBench-7.0/conf/spark.conf. You are advised to set -Xms to the value of spark.executor.memory minus 1.
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Open the Spark configuration file.