Rate This Document
Findability
Accuracy
Completeness
Readability

Feature Description

This section describes the hnswlib features optimized for the Kunpeng platform and explains how these optimizations differ from the original hnswlib implementation. It covers vectorization, support for the FP16 data type, prefetch optimization, and instruction-reordering optimization. Except for the API differences noted in this section, all other APIs remain compatible with the open source hnswlib APIs.

Vectorization

Distance computation has been vectorized by using SVE and NEON instructions in place of the original SSE and AVX implementations.

For example, L2 distance is computed in parallel over 16 float values. The Arm-based NEON implementation uses vld1 to load four FP32 values at a time, and the loop is unrolled four times to process 16 FP32 values per iteration.

FP16 Support

Distance computation has been extended to support the FP16 data type.

On the Kunpeng platform, this is implemented with NEON instructions. vld1 loads four FP16 values at a time, each is converted to FP32 (float) for computation, and the loop is unrolled four times to process 16 FP16 values per iteration.

Prefetch Optimization

The hnswlib multi-layered graph search algorithm uses prefetching to reduce memory access latency. It loads node visitation states and the mapping of internal IDs to data pointers into the cache ahead of use, which improves overall search throughput.

Instruction Reordering Optimization

Node IDs in an HNSW graph are reordered to maximize local continuity of neighbor node IDs while preserving the graph topology. This reordering optimization improves cache locality and access speed.

API Differences

Table 1 APIs for constructing an HNSW graph

Data Type

API

API Alias

FP32

HierarchicalNSWTemplat<float>

HierarchicalNSWFloat

FP16

HierarchicalNSWTemplate<float16_t>

HierarchicalNSWHalf

Table 2 APIs for calculating distance space

Distance type

Data Type

API

API Alias

L2

FP32

L2SpaceTemplate<float>

L2SpaceF

FP16

L2SpaceTemplate<float16_t>

L2SpaceHalf

IP

FP32

InnerProductSpaceTemplate<float>

InnerProductSpaceF

FP16

InnerProductSpaceTemplate<float16_t>

InnerProductSpaceHalf

Except for the API differences noted in this section, all other APIs remain compatible with the open source hnswlib APIs.

Difference in Distance Computation Implementations

The difference between FP32 and FP16 distance computations is data type used for the computation. In FP32 computation, vectors are loaded as FP32 and used directly in subsequent calculations. In FP16 computation, vectors are loaded as FP16, converted to FP32, and then used for the calculations. In both cases, the final results are returned in FP32 format. For example, the NEON implementation of data loading for the two cases is as follows:
if constexpr (std::is_same_v<T, float>) {
    v1 = vld1q_f32(pVect1);
    v2 = vld1q_f32(pVect2);
} else if constexpr (std::is_same_v<T, float16_t>) {
    v1 = vcvt_f32_f16(vld1_f16(pVect1));
    v2 = vcvt_f32_f16(vld1_f16(pVect2)); 
}