Application Acceleration with the Kunpeng BoostKit

Enable optimal data access and processing performance and ultimate experience of cloud phones.

Figure 1 Basic acceleration capabilities of Kunpeng BoostKit for each scenario

Figure 2 Application acceleration capabilities of Kunpeng BoostKit for each scenario

Cloud Phone

The Kunpeng BoostKit for Cloud Phone solution leverages homogeneous Arm instruction sets to support lossless migration of mobile apps to the cloud. It delivers a cloud phone Turbo toolkit by incorporating the Kbox cloud phone container, instruction stream engine, video stream engine, and other capabilities, increasing the server deployment density and reducing the cost per cloud phone. The host OS can be Ubuntu or openEuler and the guest OS can be android-9.0.0_r55 or android-11.0.0_r48. The cloud phone Turbo toolkit facilitates secondary development, simplifying cloud phone application development and improving service experience.

Figure 3 Overall architecture of cloud phone development with Kunpeng BoostKit for Cloud Phone

ExaGear AArch32 instruction translation software
The ExaGear AArch32 instruction translation software provides AArch32 features for Kunpeng servers. This software is fully compatible with AArch32 applications in cloud phone scenarios, ensuring that AArch32 v8.0 instructions run smoothly on Kunpeng servers. In addition, this software can shorten the application startup time by using the pre-translator feature.
Kbox cloud phone container
The Kbox cloud phone container provides basic capabilities for software-defined mobile phones. It leverages the Docker container technology and Android Open Source Project (AOSP) to implement a lightweight emulation architecture solution that directly connects GPUs to containers. It provides an Android-based cloud phone container reference solution on Kunpeng servers. This solution features high density and broad compatibility. A single Kunpeng server supports the concurrent running of up to 100 x 720p@30 fps Kbox cloud phones in the hosting scenario (the concurrent number of cloud phones varies with the actual application scenario).
Video stream engine
This video stream-based device-cloud synergy engine provides low-latency image synchronization for cloud phones. It supports H.264 and H.265 encoding. At the same image quality, it reduces the bandwidth consumption by 30%+ for H.265. To do this, it uses the powerful capabilities on the cloud to run and render applications and games, compress and stream videos, and display the videos on terminals. The video stream engine supports core functions such as video encoding/decoding/playback, cloud phone screenshot, touch control, and audio capture/playback. You can perform secondary development based on these functions to control applications and games on mobile terminals. Professional graphics cards are used for rendering on the cloud. Terminals can receive a high image quality with a simple video decoding capability. Unified APIs simplify secondary development and facilitate integration.
Instruction stream engine
The instruction stream engine uses the unique device-cloud separated rendering technology in the industry. No GPU needs to be deployed on the cloud, decreasing the overall hardware cost by 10%. With the extensive computing power on the cloud, the instruction stream cloud phone solution uses the engine to copy rendering instructions of applications and games on the cloud, and compress and stream rendering instructions and texture data. On the device, the GPUs of mobile phones render the instructions into images. The instruction stream engine supports core functions such as separate rendering of instructions, video streaming of texture data, touch control, and audio capture and playback. You can perform secondary development based on these functions to control applications and games on mobile terminals. The instruction stream engine technology can render the entire cloud phone system and provide near-lossless image quality without affecting the transmission bandwidth in the 1080p, 2K, or 4K resolution. In addition, the resource buffering technology reduces the bandwidth consumption by more than 50%. The instruction stream engine technology breaks through the GPU capability limitation on the cloud, realizes the GPU-free high-density running mechanism, and reduces the hardware cost per channel by 40%. Moreover, it supports local execution and remote synchronization of the graphics rendering state machine and delivers 1080p@30 fps at a low latency.

Big data

Kunpeng BoostKit for Big Data accelerates big data analytics to address efficiency and performance issues, with a series of features including OmniRuntime, machine learning and graph analysis algorithm libraries, as well as open source enablement and optimization of big data components.

Figure 4 Acceleration packages for big data applications

openEuler and BiSheng JDK performance optimization
Based on the core big data components Hive (2.X/3.X) and Spark (2.X), openEuler improves big data computing performance through disk I/O, network I/O scheduling policy optimization, and NEON instruction optimization. Incubated in the open source JDK community built based on Kunpeng processors, BiSheng JDK improves the computing performance of the core big data components Hive and Spark through AppCDS, GC algorithm optimization, and compilation optimization. It increases Hive performance by 2% to 25% and Spark performance by 3% to 25%.
OmniRuntime
OmniRuntime consists of a series of application acceleration features provided by Kunpeng BoostKit for Big Data. It aims to improve the big data analysis performance using plugins throughout data loading, computing, and exchange. The features include OmniData (operator pushdown), OmniOperator (operator acceleration), OmniShuffle (shuffle acceleration), OmniMV (materialized views), OmniAdvisor (parameter tuning), and OmniHBaseGSI (global secondary indexes). Spark leverages OmniRuntime to perform SQL computations, resulting in a 20% to 40% performance improvement compared to computations without OmniRuntime. The following components are included:
- OmniData
  OmniData is a great choice for big data scenarios where storage and compute are decoupled or coupled at scale. It supports Spark 3.0.0, Spark 3.1.1 and Hive 3.1.0 (Tez 0.10.0). It pushes operators of a big data engine to storage or offload nodes to implement near-data computing. This reduces network bandwidth consumption and improves the query performance of the query engine. According to a TPC-H test with OmniData enabled, the performance of Spark executing 12 SQL statements is improved by an average of 40%, and the performance of Hive executing 4 SQL statements is improved by an average of 20%.
- OmniOperator
  OmniOperator is suited for virtualization scenarios and supports Spark 3.1.1, Spark 3.3.1, Spark 3.4.3, Spark 3.5.2, and Hive 3.1.0. It uses native code (C/C++) to implement big data SQL operators to improve query performance. With columnar storage and vectorized execution technologies as well as the Kunpeng BoostKit Library, OmniOperator improves operator execution efficiency and query performance of the query engine. According to a TPC-DS test with OmniOperator enabled, the performance of Spark executing 99 SQL statements is improved by 30%.
- OmniShuffle
  OmniShuffle is suited for virtualization scenarios. It supports Remote Direct Memory Access (RDMA) and TCP network modes and Spark 3.1.1, Spark 3.3.1, and Hive 3.1.0. Based on network media such as TCP and RDMA, OmniShuffle optimizes the cross-node data write, transmission, and read processes during data analysis to improve the data shuffle and analysis performance. According to a TPC-DS test with OmniShuffle enabled, the performance of Spark executing 99 SQL statements is improved by 40%.
- OmniMV
  OmniMV is suited for virtualization scenarios and supports Spark 3.1.1, Spark 3.4.3, and ClickHouse 22.3.6.5. It uses AI algorithms to recommend the optimal materialized view from historical SQL queries, automatically matches SQL statements with a materialized view in Spark or ClickHouse, and replaces part of the SQL statements in an execution plan with the matched materialized view. OmniMV greatly reduces repeated calculations and increases query efficiency. According to a TPC-DS test with OmniMV enabled, the Spark computing performance is improved by 30%, and according to a Star Schema Benchmark test with OmniMV enabled, the ClickHouse computing performance is improved by several times.
- OmniAdvisor
  OmniAdvisor is suited for VM scenarios and supports Spark 3.1.1 and Hive 3.1.0 (Hive on Tez mode). The Spark and Hive big data engines have a large number of parameters that require tuning to get the optimal performance. However, manual tuning is often far from being efficient. OmniAdvisor aims to improve the tuning efficiency through AI-based automatic parameter recommendation. According to a TPC-DS test with OmniAdvisor enabled, the performance of Spark executing 10 SQL statements is improved by 10%.
- OmniHBaseGSI
  OmniHBaseGSI is suited for VM scenarios and supports HBase 2.4.14. Open source HBase has a built-in primary key index. However, if a non-row key is used for query, the entire table must be scanned, which consumes a large number of resources and also prolongs the query duration. OmniHBaseGSI, the HBase global secondary index feature, creates global secondary indexes for non-row key columns to accelerate queries on these columns. OmniHBaseGSI ensures an average latency of less than 30 ms and P99 latency of less than 300 ms in the case of 100 concurrent connections.
- OmniScheduler
  In a Hadoop (Hadoop 3.3.4 supported) cluster with unbalanced load between nodes, the OmniScheduler Yarn load scheduling algorithm optimizes the open source Capacity Scheduler to schedule resources based on the weight calculation and sorting results of cluster nodes' physical resources. OmniScheduler enables balanced resource configuration and efficient resource utilization.
Machine learning and graph analysis algorithms
Compatible with APIs of open source Spark machine learning and graph analysis algorithms, and supports Spark 2.3.2, 2.4.5, and 2.4.6. Some of the algorithms support Spark 3.1.1 and Spark 3.3.1. Kunpeng processors help optimize machine learning and graph algorithms based on algorithm principles, greatly improving the performance compared with native algorithms. Compared with open source Spark's native MLlib and GraphX algorithms, the machine learning and graph analysis algorithm libraries based on Kunpeng processors delivers over 20% higher computing performance at the same precision level.

SDS

Kunpeng BoostKit for SDS addresses the issues of low performance and high costs in open source Ceph storage. It gives full play to the computing power of Kunpeng with a wealth of features including storage acceleration algorithms and Ceph acceleration libraries.

Compression algorithm
This data compression algorithm is suited for block and object storage services. Compared with mainstream open-source compression algorithms, it increases the compression ratio by 25% and bandwidth performance by 10%.
KSAL
The Kunpeng Storage Acceleration Library (KSAL) supports Ceph 14.2.8. It uses Kunpeng-optimized algorithms to replace open source algorithms, improving storage performance. The KSAL includes the erasure code (EC) algorithm, CRC16 T10DIF algorithm, and CRC32C algorithm.
- EC algorithm
  Based on the Huawei-developed vectorized EC encoding and decoding solution, the EC algorithm replaces the high-order finite field GF(2^w) multiplication required in the erasure coding process with binary matrix multiplication through isomorphism mapping, so as to use exclusive or (XOR) instead of complex finite field multiplication that is implemented through table lookup. In addition, the EC algorithm uses an encoding orchestration algorithm to reuse intermediate results in the parity block calculation process, which reduces XOR operands and accelerates encoding by working with Kunpeng vectorized instructions. Compared with open source EC, the KSAL EC algorithm has better performance, delivering 2x encoding performance than mainstream open source EC.
- CRC16 T10DIF and CRC32C algorithms
  CRC16 T10DIF and CRC32C use a modulo algorithm for large numbers and Kunpeng vectorized instructions to accelerate encoding. Compared with mainstream open source algorithms, the 4 KB verification performance of CRC16 T10DIF is increased by 130% and that of CRC32C is increased by 30%.
- memcpy algorithm
  The memcpy algorithm uses CPU prefetch and Kunpeng vectorized instruction acceleration. It improves the 4 KB performance by 30% compared with the built-in memcpy algorithm of glibc.
- DAS smart prefetch
  The DAS smart prefetch algorithm analyzes I/O information and prefetches data to the read cache in advance, doubling the read performance of 4 KB sequential streams.
- zstd algorithm for compressing the metadata of 10 billion objects in Ceph object storage
  The zstd algorithm enables Ceph RocksDB metadata reduction to ensure that impact on performance is less than 50% in 10-billion-scale object storage.
KSML
The Kunpeng Storage Maintenance Library (KSML) is a Huawei-developed storage maintenance library that provides HDD/SSD fault prediction and slow HDD/SSD detection. Based on machine learning algorithms, this library collects SMART data to train models, predicts and identifies potential faulty drives in storage clusters, and collects the svctm information of system drives to detect slow drives.
KAE-enabled SPDK
As the virtual device layer, the SPDK block device (bdev) interconnects with underlying virtual and physical devices. By enabling compression, encryption, and decryption in the bdev, all SPDK devices can be supported. The KAE performs compression, encryption, and decryption using Zlib and OpenSSL. Specifically, the KAE is enabled for the bdev to implement hardware offload.
EC Turbo
The EC Turbo feature supports Ceph 14.2.8 and is suitable for block and object storage services. It does not work on bcache. It optimizes the EC process of the open source Ceph solution to decrease the I/O amplification ratio in the data read/write process. In balanced configuration, the performance of EC Turbo can reach over 80% of the x86 three-copy mode while reducing the storage cost by 50% for the block storage service over the open source EC technology. For the object storage service, the performance of EC Turbo (4+2) can reach over 80% of the x86 three-copy mode while reducing the storage cost by 50% in large I/Os and maintaining the equivalent cost in small I/Os.
Smart write cache
The smart write cache feature supports Ceph 14.2.8 and is suited for data writes in block and object storage services. It uses I/O passthrough, QoS policies, writeback policies, and GC policies to improve the write performance of Ceph clusters where bcache is enabled. In block storage random write scenarios, this feature increases the IOPS performance by more than 20%.
I/O passthrough
The I/O passthrough tool is an I/O process optimization tool for Ceph clusters in balanced configuration. It automatically improves Ceph cluster performance. In balanced configuration, I/O passthrough can increase the storage performance by more than 15%.
Data compaction
The data compaction algorithm eliminates data waste caused by zero padding and combines with functions including data encapsulation, space allocation based on block counting, granularity-based traffic diversion, batch submission, and batch callback to improve the data reduction ratio and overall system IOPS. This reduces costs and improves performance. The data compression ratio is increased by more than 20% without hindering system performance.
Metadata acceleration
Based on RocksDB, the metadata acceleration feature uses a Huawei-developed algorithm to enable Kunpeng acceleration for better storage performance. Compared with open source RocksDB, this feature improves the performance in mixed read/write scenarios by more than 30%.
Ucache smart read cache
The Ucache smart read cache feature uses smart I/O prefetch to accurately identify hotspot requests, prefetch I/Os of the sequential pattern, interval pattern and more, and load I/Os to the read cache in advance. In addition, the read cache uses the LRU algorithm to evict cold data, increasing the I/O hit ratio and read performance. This feature improves the I/O hit ratio of read requests to improve read performance. It doubles the performance of hotspot, sequential, and interval I/Os.
BoostIO
In the decoupled storage and compute architecture, BoostIO uses memory and drive resources on the compute side to build a distributed multi-tier cache. The write cache uses RDMA high-speed communication, cache affinity, data replication, and linear layout characteristics to improve service write performance and data reliability. The read cache pre-loads hotspot data to cache drives through data prefetch and leverages the LRU and cold and hot data identification algorithms to improve the read cache hit ratio, thereby improving the read performance.
RDMA network acceleration
A plugin is applied to the Ceph network framework AsyncMessage to support the UCX network framework, which enables full RDMA in Ceph all-flash storage. The UCX communication processing layer adapts Ceph and UCX interfaces and implements zero copy based on rendezvous (RNDV) protocol characteristics to improve large-block write performance.

Confidential computing

The Kunpeng BoostKit for Confidential Computing TrustZone Kit is an ARM TrustZone–based software kit, including the Huawei-developed TEE secure OS iTrustee, iBMC and BIOS of the Kunpeng server, and open source OS driver and SDK. It helps to build confidential computing solutions and aims to provide integrity, confidentiality protection, and trusted use for your key data.

The TrustZone kit is not a mandatory component of Kunpeng servers. If you need the TrustZone it, specify that you want the TEE function when purchasing a Kunpeng server. A Kunpeng server with the TEE function comes preconfigured with the TrustZone kit.

Based on TrustZone, iTrustee offers a complete security solution, including a CA in normal mode, a TA in secure mode, and a trusted OS in secure mode.

iTrustee is suited for financial data mining and protects data confidentiality during data processing. It ensures trusted data transactions in all-in-one data centers. It is also a trusted identity authentication solution in privacy computing that prevents privacy leakage during computing.

iTrustee is implemented based on the Huawei-developed microkernel. This secure OS has been put into commercial use on mobile phones for nearly 10 years, and its user base has exceeded 100 million. iTrustee has been granted the CC EAL4+ security certification and the GlobalPlatform compatibility certification. The secure memory in the TEE can be configured on demand. A maximum of 512 GB secure memory can be configured to support large applications such as big data and AI.

Figure 5 Confidential computing

TEE SDK: It provides the rich execution environment (REE) and TEE APIs, TA/CA encryption and signature tools, reference code, and API description for developers to quickly build applications.
REE patch: An OS driver that includes the kernel module and a user interface library.
TEE OS: A Huawei-developed secure OS that provides services such as encryption, decryption, and secure storage for trusted applications (TAs) and ensures the integrity and confidentiality of TAs.
BIOS: It completes TEE OS decryption and verification to ensure the confidentiality and integrity of the TEE OS.
BMC: It manages and upgrades the TEE OS.

The Kunpeng BoostKit for Confidential Computing TEE Kit enables confidential virtual machines (cVMs) in the trusted execution environment (TEE) using the Secure Execution Level 2 (S-EL2) feature. With the TEE Kit, software stacks in common VMs can be migrated to a confidential environment without adaptation. The TEE Kit consists of the Kernel-based Virtual Machine (KVM), Trusted Management Interface (TMI), and Trusted Management Monitor (TMM). Figure 6 illustrates the overall system architecture.

Figure 6 TEE Kit

**Table 1** TEE Kit description
Category	Subcategory	Description
Industrial customers	Host/Guest OS	Customers can choose a Linux OS support Virtualized Arm Confidential Compute Architecture (virtCCA) to install the host and guest OSs. The host and guest OSs are open-sourced in the openEuler community. Libvirt and QEMU: They are used to deploy and manage cVMs. KVM: It runs in the normal world to schedule tasks, allocate resources, and manage the lifecycle of all cVMs. TrustZone Management Interface (TMI): The KVM communicates with the TMM through the TMI.
Huawei deliverables	TMM	This virtualization component runs in the TEE and manages CPU and memory resources of cVMs. Generally, a Kunpeng server that supports the TEE Kit is equipped with the TMM (upgradeable) in the hardware platform.
	Hardware firmware	To support the TEE Kit, the hardware firmware is adapted as follows: BIOS: supports TMM decryption, secure boot, and function configuration. BMC: manages and upgrades the TMM. The hardware firmware that supports the TEE Kit is pre-installed with the hardware in the production line. You need to obtain the latest firmware version.
	TEE Kit SDK	To enable the remote attestation and key derivation functions of the TEE Kit to be integrated into customers' applications, the TEE Kit provides the following software development kits (SDKs) for customers: Remote attestation library Sealing key library RATS-TLS library

Measured boot
The measurement process in the TEE Kit for Kunpeng confidential computing is as follows:
- The Kunpeng Hardware Security Module (HSM) is used as the root of trust (RoT). The TEE-related firmware is measured when Kunpeng devices are starting up, and the measurement result is stored as a platform measurement report into the SRAM of the HSM.
- The kernel and startup parameters during VM startup are measured to generate a VM measurement report.
- The reports are packaged to form a complete measurement token, which provides the remote attestation capability.
Remote attestation
The remote attestation feature of the Kunpeng BoostKit for Confidential Computing TEE Kit aims to prove that cVMs and the confidential computing platform are trustworthy.
- Whether cVMs are running in a real confidential computing environment
- Whether cVM parameters or code has been tampered with
Sealing key
cVMs enabled by the TEE Kit support sealing keys. A cVM can generate an associated key, which remains unchanged even after the cVM is restarted.
Secure storage
Confidential computing protects data in use. Users can encrypt stored images and leverage remote attestation to implement secure storage.
Confidential containers
Confidential containers leverage core capabilities of cVMs and basic functions provided by the Kata/Coco community, such as encryption, decryption, signature verification, Nydus image acceleration, and remote attestation, to protect containers from end to end. Confidential containers are built based on the entire software stack that combines Kubernetes, containerd, Kata, QEMU, and KVM, and the Key Broker Service (KBS) and Attestation Service (AS) provided by the Coco community. The management plane of confidential containers still uses Kubernetes and containerd, enabling you to use confidential containers in a similar way for common containers.
Device passthrough
Device passthrough utilizes the PCIe protection controller (PCIPC) embedded in the PCIe root complex of the Kunpeng processor. A selector is added to the PCIe bus to regulate communication between the processor and peripherals. Operating through the system memory management unit (SMMU), this selector controls both inbound and outbound traffic. In confidential computing scenarios, PCIe devices can be directly connected to the TEE, eliminating data forwarding or copying operations to protect the entire data link. Because of this, Kunpeng supports heterogeneous confidential computing without requiring any device reconstruction.
Encryption with SM algorithms
Hardware-based acceleration for SM algorithms is powered by the Kunpeng processor, utilizing the Kunpeng Accelerator Engine capabilities in the TEE. It employs the openEuler UADK user-mode accelerator framework to enhance SM algorithm performance and enable algorithm offloading within cVMs.

Database

Kunpeng BoostKit for Database helps to address performance issues for MySQL OLAP performance acceleration and OLTP lock performance tuning. It provides a collection of acceleration features including the MySQL Pluggable Kunpeng Online Vectorized Analysis Engine (KOVAE), MySQL lock-free tuning, MySQL pluggable thread pool, and MySQL CRC32 instruction optimization. Those features improve OLAP query and analysis efficiency as well as the OLTP online transaction processing, together with optimization features designed for the Milvus vector database, to unleash the performance of multi-core computing power. Best practices of mainstream open source and commercial databases are provided to help developers efficiently port and tune open source components.

NVMe SSD atomic write
The SSD atomic write feature can be used for all MySQL versions. It eliminates the doublewrite redundancy to increase database performance, approximately by 15% in write-intensive scenarios.
Gazelle network optimization
The Gazelle network optimization feature can be used for all MySQL versions. It directly reads and writes NIC packets in user mode based on the Data Plane Development Kit (DPDK), transmits packets through the shared hugepage memory, and enables the LwIP protocol stack to Gazelle network optimization greatly increases the network I/O throughput of applications. It is expected to increase the TPC-C comprehensive performance by 10%.
Kunpeng GCC CFGO
The Kunpeng GCC Continuous Feature Guided Optimization (CFGO) feature can be used for all MySQL versions. This feature provides continuous optimization for multi-modal objects (source code, assembly code, and binary) in more lifecycle phases (compile time, link time, and post-link time), to generate target programs with higher performance. It increases the comprehensive TPC-C performance of databases by 10%.
KAEzip compression and decompression optimization
This feature applies to Greenplum versions that support compression and decompression with zlib. It uses the Kunpeng hardware acceleration module to implement compression and decompression algorithms and works with a lossless userspace driver framework to improve query performance. KAEzip increases end-to-end performance by 10% in heavy I/O scenarios where only one request is processed at a time before the hardware bottleneck is reached.
MySQL parallel query optimization
This feature applies to MySQL 8.0.20 and MySQL 8.0.25. Only one thread can be scheduled for a single SQL query in the MySQL database and multi-core CPUs are not supported. The performance of a single query is poor and does not meet the performance requirements in query scenarios. Parallel query is used to improve the query performance. The query performance is more than doubled. (The performance improvement result is related to the degree of parallelism.)
MySQL lock-free optimization
This feature applies to MySQL 8.0.20. In MySQL OLTP applications, a large number of DML statements (INSERT, UPDATE, and DELETE) are concurrently executed on key data structures in the trx_sys global structure, causing resource competition and synchronization bottlenecks in the critical section. When this feature is enabled, the lock-free hash table is used to maintain transaction units to reduce lock conflicts and improve write concurrency. A 20% increase in write performance is achieved for Sysbench.
MySQL fine-grained lock optimization
This feature applies to MySQL 8.0.20. In MySQL OLTP applications, a large number of DML statements (INSERT, UPDATE, and DELETE) are concurrently executed on the key data structures in the lock_sys->mutex global lock, causing severe lock competition and performance deterioration. The original lock is replaced with a fine-grained hash bucket lock to prevent lock conflicts and improve concurrency. The TPC-C comprehensive performance is expected to increase by 10%.
MySQL NUMA scheduling optimization
This feature applies to MySQL 8.0.20 and MySQL 8.0.25. In high-concurrency MySQL OLTP applications, the default thread scheduling of the system causes frequent cross-NUMA access of threads. In this case, the CPU overhead increases and the performance deteriorates. Therefore, the foreground threads need to be dynamically bound to fixed NUMA CPUs to reduce cross-NUMA access and ensure that the CPU access load is balanced. Background threads need to be statically bound to fixed NUMA CPUs to reduce cross-NUMA access and improve background thread efficiency. The performance is increased by 10% in OLTP scenarios.
MySQL pluggable thread pool
This feature applies to MySQL 5.7.27, 8.0.20, 8.0.25, 8.0.30, and 8.0.35. Only the patch based on MySQL 8.0.25, 8.0.30, and 8.0.35 is pluggable and can be dynamically loaded. In high-concurrency MySQL OLTP applications, there are many threads, and the CPU is consumed by resource contention and frequent switchovers. By enabling the thread pool, all tasks are queued for execution based on the system execution capability. The number of tasks processed by each CPU at a time is limited (2 to 5 for the best), which is to ensure stable service processing capabilities. In the OLTP TPC-C scenario, before enabling the thread pool, the MySQL performance of running 10,000 concurrent tasks is only about 10% of the optimal. After the thread pool is enabled, the performance is maintained at 85%.
CRC32 instruction optimization
This feature supports a patch package for supporting MySQL 8.0.25. It uses Kunpeng CRC32 hardware instructions to replace the software implementation of the CRC32 algorithm, thereby improving system performance. It improves the MySQL sysbench write performance by 5%.
MySQL pluggable online vectorized analysis engine
This feature applies to MySQL 8.0.25. It is a lightweight implementation of the MySQL reserved API (secondary engine). It utilizes the multi-core advantages of the Kunpeng processor through parallel computing of execution plans, multiplying the OLAP performance. This plugin is pluggable and can be dynamically loaded. The parallel acceleration technique improves the OLAP query performance by more than three times.
Milvus KScaNN optimization
This feature supports Milvus 2.4.5. It connects to KScaNN, a self-developed recall algorithm of Kunpeng through the reserved Milvus interface. By utilizing the advantages of Kunpeng, the feature improves the query performance (QPS) by over 30% with a high recall rate (0.95).
Milvus KBest optimization
This feature supports Milvus 2.4.5. It connects to KBest, a self-developed recall algorithm of Kunpeng through the reserved Milvus interface. By utilizing the advantages of Kunpeng, the feature improves the query performance (QPS) by over 30% with a high recall rate (0.99).
Milvus vector instruction optimization
This feature supports Milvus 2.4.5. It is implemented using the SEV instruction set as well as hardware and software prefetch. It reduces the overhead of distance function calculation to improve the Milvus query performance (QPS) by 20% with a high recall rate (0.99).

Virtualization

Kunpeng BoostKit for Virtualization addresses the issues such as low virtualization light-load performance, heavy network loss, severe resource fragmentation, and open source ecosystem availability. It provides features such as Open Virtual Switch (OVS) flow table NIC acceleration to improve system performance, giving full play to the computing power of Kunpeng based on the multi-core architecture and inter-core isolation.

OVS flow table NIC acceleration
In virtualization scenarios, the OVS forwarding flow table is offloaded to the NIC hardware, and the table lookup capability of the hardware is used to accelerate flow table lookup and improve the processing capability of the virtualized network. The forwarding performance of the virtualized network is increased by 10 times.
Virtualization scheduling optimization
Kunpeng BoostKit for Virtualization accelerates CPU scheduling for applications on VMs based on software-hardware collaboration.
- The CPU topology structure is directly passed to the VM through the NUMA awareness and cluster awareness features. The VM OS kernel utilizes the cluster task scheduling optimization to accelerate multi-thread/process calls.
- The lock mechanism during preemption is optimized to improve VM performance in overcommitment scenarios.
- The hardware deadlock mechanism is introduced to prevent VM suspensions and recovery failures caused by hardware deadlocks.
KAE accelerated live migration
This feature is used for VM live migration. The KAE compression module provides the standard zlib interface KAEZlib. This module can replace the native zlib library to accelerate VM live migration. Compared with the native zlib library, KAE can significantly save CPU resources during compression and decompression for VM live migration. With the same CPU resource consumption, KAE can significantly increase VM live migration speed.
Hardware-assisted virtualization acceleration
This feature is suited for network- and I/O-intensive services. When hardware-assisted virtualization acceleration is enabled on new Kunpeng 920 processor models, the direct interrupt injection of GICv4.1 (including vSGI passthrough) cuts interrupt response times and enhances throughput for demanding network and I/O workloads.
Hot swap
- vCPU hot swap: vCPUs of VMs simulate an ACPI Generic Event Device (GED) based on the ACPI specification. It dynamically simulates CPU power-on and power-off through interrupts and processing functions during vCPU adjustment.
- QEMU VM memory hotplug: This feature makes the VM's XML configuration file contain a NUMA node with 0 initial memory and dynamically adds memory to the NUMA node using memory hotplug commands.
MPAM plugin
MPMA helps to restrict the memory bandwidth and L3 cache capacity occupied by offline services to prevent offline services from affecting the performance of real-time services.
- The MPAM plugin can be deployed on each compute node to configure resource groups in the YAML file. The memory bandwidth and L3 cache capacity are specified for each resource group.
- When an offline service is deployed, the resource group to which the service belongs can be specified in the YAML file.
- After the MPAM plugin detects a deployment task, the plugin allocates the process ID of the container service to the corresponding resource group. (The restriction information is configured on the hardware chip through the OS.)
  
  The MPAM plugin manages these shared resources: L2 cache, L3 cache, and DMC bandwidth.
Kubernetes NUMA affinity scheduling plugin
This feature is suited for container overcommitment scenarios. It supports Kubernetes 1.28.4 and Containerd 1.7.14. It captures container requests through NRI mode during container runtime, and sets the container cgroup parameter based on the scheduling policy, thereby implementing NUMA affinity management. This feature improves container performance by 5% to 10% in container overcommitment scenarios.
Kubernetes SR-IOV device plugin
This feature can improve network and encryption/decryption performance for Kubernetes. It uses the Devices Plugin to manage SR-IOV devices, simplifying SR-IOV device passthrough for containers. The feature supports passthrough NICs and KAE devices, accelerating network and encryption/decryption performance in container scenarios.
Virtualized KAE (vKAE)
This feature is suited for services that frequently use encryption, decryption, and decompression in Kunpeng VMs. The KAE is a hardware-based acceleration solution powered by Kunpeng processors and includes KAE encryption and decryption and KAEzip. vKAE devices can also enable KAE capabilities in VMs or containers. KAE encryption and decryption and KAEzip are used to accelerate SSL/TLS applications and data compression, respectively. They can significantly reduce processor consumption and improve processor efficiency. In addition, KAE shields the internal implementation details from the application layer. You can quickly migrate services by using the standard OpenSSL and zlib interfaces.

SRA

Kunpeng BoostKit for Search, Recommendation, and Advertisement (SRA) provides a full-stack acceleration solution for Internet services based on the Kunpeng platform. It covers the core search algorithms in recall scenarios, and the full-stack software and core AI operator library of TensorFlow for model inference in ranking scenarios.

SRA_Recall
SRA_Recall is a recall algorithm library provided by Huawei and optimized based on the Kunpeng platform. It includes KBest and KScaNN.
- Kunpeng Blazing-fast embedding similarity search thruster (KBest) is an efficient, Huawei-developed image search algorithm. In multi-dimensional vector approximate nearest neighbor searches, KBest employs methods such as quantization and vector instructions, to optimize the search performance and precision. It provides the search capability benchmarking against Faiss HNSW, and is suited for network search, multi-modal search, recommendation system, and retrieval-augmented generation (RAG).
- Kunpeng Scalable Nearest Neighbors (KScaNN) is a vector retrieval algorithm that is based on inverted indexes. It uses the Kunpeng architecture to deeply optimize the index layout, algorithm process, and computing process, fully unleashing the chip potential. It provides the search capability benchmarking against ScaNN, and is suited for network search, multi-modal search, recommendation system, and retrieval-augmented generation (RAG).
SRA_Inference
SRA_Inference is an inference acceleration kit provided by Huawei based on the Kunpeng platform. It includes the Kunpeng Tensorflow Operator (KTFOP) library.
- KTFOP is an efficient, Huawei-developed TensorFlow operator library. It uses single instruction multiple data (SIMD) instructions and multi-core scheduling to accelerate operator processing in CPUs and reduce the usage of CPU computing resources, thereby increasing the overall end-to-end throughput of online inference. This approach is well suited for inference scenarios such as search, recommendation, and advertising.
KAIL
The Kunpeng Artificial Intelligence Library (KAIL) is a high-performance AI operator library optimized by Huawei for the Kunpeng platform. It consists of a deep neural network library and an extension library that contains the softmax and random_choice operators.
- KAIL_DNN: Based on the microarchitecture features of the Kunpeng processor, KAIL_DNN improves the performance of core DNN operators through vectorization, assembly, and algorithm optimization, and can be integrated into open source oneDNN as a plugin to provide complete capabilities. KAIL_DNN is suited for AI and HPC applications.
- KAIL_DNN_EXT: It serves as the extension library of KAIL_DNN. KAIL_DNN_EXT optimizes operators such as softmax and random_choice, and encapsulate them into a Python interface library for specific AI scenarios. KAIL_DNN_EXT is suited for AI applications.

HPC

Kunpeng BoostKit for HPC focuses on key challenges such as improving resource scheduling efficiency and optimizing application performance. It builds a full-stack high-performance computing platform using full-stack architecture innovations, Huawei-developed software and hardware, optimized base software, and industry application performance tuning, unleashing the platform computing power, shortening the product TTM, and improving the competitiveness of enterprise products.

The Donau Portal is the HPC cluster management platform. It allows GUI-based data management and software and hardware resource management. It streamlines the workflow to achieve efficient scheduling of jobs and appropriate allocation of resources, improving the compute resource utilization of the cluster.
- Desktop-style layout: Provides a WebUI in a desktop-style layout with multi-window and multi-task interaction.
- Integrated computing and design: supports remote 2D/3D visualization based on Linux to streamline the design and computing processes.
- Resource analysis and monitoring: Analyzes cluster running history from multiple dimensions and monitors cluster resource usage in real time.
- Heterogeneous cluster management: Manages the Donau Scheduler cluster and third-party scheduler clusters at the same time, and manages data and resources in a unified manner.
The Donau Scheduler provides job scheduling with high resource utilization and throughput for large clusters.
- Ultra-large scale scheduling: Up to 3,000 nodes containing 380,000 cores can be scheduled in an ultra-large cluster.
- High throughput: High end-to-end throughput allows over 4 million jobs to run per hour.
- Efficient resource allocation: The efficient and flexible scheduling framework achieves a 90%+ resource allocation rate.
The HPCKit integrates basic HPC software that is deeply optimized for the Kunpeng platform, such as the Hyper MPI, Kunpeng Math Library (KML), and Kunpeng BiSheng/GCC Compiler. It performs one-click deployment and optimal collaboration, enabling HPC applications to reach their ultimate performance.
- Based on Open MPI 4.1.1 and Open UCX 1.10.1, Hyper MPI supports parallel computing APIs of the MPI-V3.1 standard, and optimizes the collective communication framework. In addition, Hyper MPI accelerates the network for data-intensive and high-performance computing, enables a high-speed communication network and shared memory mechanism between nodes, and provides optimized collective communication algorithms. The maximum data packet length supported by the UCX COLL communication framework of Hyper MPI is 2³² bytes.
- Kunpeng Math Library (KML) is an acceleration library optimized on Huawei Kunpeng processors. It is designed to provide high-performance mathematical computing and comprises 14 sublibraries (KML_BLAS, KML_SPBLAS, KML_VML, KML_MATH, KML_FFT, KML_LAPACK, KML_SVML, KML_SOLVER, KML_JAVA, KML_SCALAPACK, KML_VSL, KML_NUMPY, KML_EIGENSOLVER, and KML_IPL). KML supports common mathematical functions such as fast Fourier transform (FFT), matrix calculation, vectorization, and trigonometric and logarithmic functions.
- The BiSheng Compiler is a high-performance compiler developed based on the open source LLVM and optimized for the Kunpeng platform. It supports the Fortran language. In addition to general functions and optimization of LLVM, the BiSheng Compiler has the middle and back-end key technologies optimized and the AutoTuner feature integrated to support automatic tuning of the compiler.

Parent topic: Kunpeng BoostKit