Feature Description

This section describes the hnswlib features optimized for the Kunpeng platform and explains how these optimizations differ from the original hnswlib implementation. It covers vectorization, support for the FP16 data type, prefetch optimization, and instruction-reordering optimization. Except for the API differences noted in this section, all other APIs remain compatible with the open source hnswlib APIs.

Vectorization

Distance computation has been vectorized by using SVE and NEON instructions in place of the original SSE and AVX implementations.

For example, L2 distance is computed in parallel over 16 float values. The Arm-based NEON implementation uses vld1 to load four FP32 values at a time, and the loop is unrolled four times to process 16 FP32 values per iteration.

FP16 Support

Distance computation has been extended to support the FP16 data type.

On the Kunpeng platform, this is implemented with NEON instructions. vld1 loads four FP16 values at a time, each is converted to FP32 (float) for computation, and the loop is unrolled four times to process 16 FP16 values per iteration.

Prefetch Optimization

The hnswlib multi-layered graph search algorithm uses prefetching to reduce memory access latency. It loads node visitation states and the mapping of internal IDs to data pointers into the cache ahead of use, which improves overall search throughput.

Instruction Reordering Optimization

Node IDs in an HNSW graph are reordered to maximize local continuity of neighbor node IDs while preserving the graph topology. This reordering optimization improves cache locality and access speed.

API Differences

**Table 1** APIs for constructing an HNSW graph
Data Type	API	API Alias
FP32	HierarchicalNSWTemplat<float>	HierarchicalNSWFloat
FP16	HierarchicalNSWTemplate<float16_t>	HierarchicalNSWHalf

**Table 2** APIs for calculating distance space
Distance type	Data Type	API	API Alias
L2	FP32	L2SpaceTemplate<float>	L2SpaceF
L2	FP16	L2SpaceTemplate<float16_t>	L2SpaceHalf
IP	FP32	InnerProductSpaceTemplate<float>	InnerProductSpaceF
IP	FP16	InnerProductSpaceTemplate<float16_t>	InnerProductSpaceHalf

Except for the API differences noted in this section, all other APIs remain compatible with the open source hnswlib APIs.

Difference in Distance Computation Implementations

The difference between FP32 and FP16 distance computations is data type used for the computation. In FP32 computation, vectors are loaded as FP32 and used directly in subsequent calculations. In FP16 computation, vectors are loaded as FP16, converted to FP32, and then used for the calculations. In both cases, the final results are returned in FP32 format. For example, the NEON implementation of data loading for the two cases is as follows:

if constexpr (std::is_same_v<T, float>) {
    v1 = vld1q_f32(pVect1);
    v2 = vld1q_f32(pVect2);
} else if constexpr (std::is_same_v<T, float16_t>) {
    v1 = vcvt_f32_f16(vld1_f16(pVect1));
    v2 = vcvt_f32_f16(vld1_f16(pVect2)); 
}

Parent topic: hnswlib Description