Kunpeng Storage Acceleration Library

Overview

The Kunpeng Storage Acceleration Library (KSAL) is developed by Huawei. It contains the EC, CRC16, and CRC32 algorithms.

The EC algorithm replaces the high-order finite field GF(2^w) multiplication required in the EC process with binary matrix multiplication through isomorphism mapping and uses exclusive or (XOR) instead of complex finite field multiplication that is implemented via lookup tables. In addition, it uses an orchestration algorithm to reuse intermediate results in the parity block calculation process, which reduces XOR operations and accelerates coding by working with Kunpeng vectorized instructions.

The CRC16 and CRC32 algorithms optimized based on the principles of a large-number modulo algorithm are used to replace the standard CRC16 and CRC32 algorithms, respectively. They have better Kunpeng affinity, improving system performance.

Technical Principles

Algorithm principles:

Vectorized coding is used to replace the high-order finite field multiplication of traditional scalar coding (Jerasure, ISA-L) with low-order binary XOR operations (acceleration via vectorization). LUT and 10+ instruction operations are required. With coding orchestration, intermediate results are reused to reduce the number of operations, and thus the EC coding performance is greatly improved.

Figure 1 Multi-level acceleration via vectorized coding

Running context:

Plog client (index side)

Interfaces:

EC coding, decoding, reconstruction, and degradation read interfaces

CRC16 verification

Assuming that the data information sequence polynomial is M(x) and the primitive polynomial is P(x), the CRC is defined as follows:

Assume that the dividend is 10 and the divisor is 3. You can divide 10 by 6 first, and then divide the remainder by 3. That is, the remainder of 10 divided by 6 is 4, and the remainder of 4 divided by 3 is 1.

Because there are multiple calculation methods, the selection scope can be narrowed down based on the following criteria:

The number of Q(x) non-zero terms corresponds to the algorithm complexity.
The order of Q(x) corresponds to the number of registers required by the algorithm.
The difference between the order of the highest-order term and the order of the second highest-order term of Q(x) corresponds to the degree of parallelism required by the algorithm.

Based on the application scenario, you can select a proper method according to the preceding criteria.

CRC32 verification

Algorithm principles:

#ifdef __aarch64__
#define CRC32D(crc, value) __asm__("crc32x %w[c], %w[c], %x[v]":[c]"+r"(crc):[v]"r"(value))
#define CRC32W(crc, value) __asm__("crc32w %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(value))
#define CRC32H(crc, value) __asm__("crc32h %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(value))
#define CRC32B(crc, value) __asm__("crc32b %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(value))

Running context:

Message verification and data consistency verification

Expected Results

Compared with mainstream open source EC algorithms, the average coding throughput is doubled.

CRC16 verification

Compared with mainstream open source CRC16 algorithms, the 4 KB verification performance of this algorithm is doubled.

CRC32 verification

The CPU computing power consumed by a single I/O operation is reduced by more than 50%, and the overall gain is estimated to be 3%. When the block size is 4 KB, 8 KB, 64 KB, 256 KB, or 1 MB, the performance is twice that of ceph_crc32c_sctp and 1.2 times that of ceph_crc32_sctp.

Parent topic: Features