Feature Description
This section describes the hnswlib features optimized for the Kunpeng platform and explains how these optimizations differ from the original hnswlib implementation. It covers vectorization, support for the FP16 data type, prefetch optimization, and instruction-reordering optimization. Except for the API differences noted in this section, all other APIs remain compatible with the open source hnswlib APIs.
Vectorization
Distance computation has been vectorized by using SVE and NEON instructions in place of the original SSE and AVX implementations.
For example, L2 distance is computed in parallel over 16 float values. The Arm-based NEON implementation uses vld1 to load four FP32 values at a time, and the loop is unrolled four times to process 16 FP32 values per iteration.
FP16 Support
Distance computation has been extended to support the FP16 data type.
On the Kunpeng platform, this is implemented with NEON instructions. vld1 loads four FP16 values at a time, each is converted to FP32 (float) for computation, and the loop is unrolled four times to process 16 FP16 values per iteration.
Prefetch Optimization
The hnswlib multi-layered graph search algorithm uses prefetching to reduce memory access latency. It loads node visitation states and the mapping of internal IDs to data pointers into the cache ahead of use, which improves overall search throughput.
Instruction Reordering Optimization
Node IDs in an HNSW graph are reordered to maximize local continuity of neighbor node IDs while preserving the graph topology. This reordering optimization improves cache locality and access speed.
API Differences
Data Type |
API |
API Alias |
|---|---|---|
FP32 |
HierarchicalNSWTemplat<float> |
HierarchicalNSWFloat |
FP16 |
HierarchicalNSWTemplate<float16_t> |
HierarchicalNSWHalf |
Distance type |
Data Type |
API |
API Alias |
|---|---|---|---|
L2 |
FP32 |
L2SpaceTemplate<float> |
L2SpaceF |
FP16 |
L2SpaceTemplate<float16_t> |
L2SpaceHalf |
|
IP |
FP32 |
InnerProductSpaceTemplate<float> |
InnerProductSpaceF |
FP16 |
InnerProductSpaceTemplate<float16_t> |
InnerProductSpaceHalf |
Except for the API differences noted in this section, all other APIs remain compatible with the open source hnswlib APIs.
Difference in Distance Computation Implementations
if constexpr (std::is_same_v<T, float>) {
v1 = vld1q_f32(pVect1);
v2 = vld1q_f32(pVect2);
} else if constexpr (std::is_same_v<T, float16_t>) {
v1 = vcvt_f32_f16(vld1_f16(pVect1));
v2 = vcvt_f32_f16(vld1_f16(pVect2));
}