Integrating KDNN

Kunpeng Deep Neural Network Library (KDNN) is integrated to reduce the latency of Neural Network (NN) operators and greatly improve the model inference performance. This section describes how to integrate KDNN into a benchmark framework.

KDNN is a high-performance AI operator library optimized for the Kunpeng platform. These optimizations are delivered by integrating operators such as MatMul, FusedMatMul, and SparseMatmul into TensorFlow.

Obtain the KDNN software package of the GCC version. Decompress the ZIP file to obtain the RPM installation package.
Install the KDNN.
1
rpm -ivh boostcore-kdnn-xxxx.aarch64.rpm
The header file installation directory is /usr/local/kdnn/include, and the library file installation directories are /usr/local/kdnn/lib/threadpool and /usr/local/kdnn/lib/omp.

In the preceding command, xxxx indicates the version.

Install KDNN header files and a static library to the /path/to/tensorflow/third_party/KDNN directory.

export TF_PATH=/path/to/tensorflow
mkdir -p $TF_PATH/third_party/KDNN/src
cp -r /usr/local/kdnn/include $TF_PATH/third_party/KDNN
cp -r /usr/local/kdnn/lib/threadpool/libkdnn.a $TF_PATH/third_party/KDNN/src

Go to the KDNN directory and apply the header file patch to fix TensorFlow's exception handling limitation.
```
cd $TF_PATH/third_party/KDNN
patch -p0 < $TF_PATH/third_party/KDNN/tensorflow_kdnn_include_adapter.patch
```

Run the build script to compile the code.

cd /path/to/serving
sh compile_serving.sh --tensorflow_dir /path/to/tensorflow --features gcc12,kdnn

Verify the integration.

Start the server.

numactl -N 0 /path/to/serving/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=8889 --model_name=deepfm --model_base_path=/path/to/model_zoo/models/deepfm --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=-1 --xla_cpu_compilation_enabled=true

numactl -N 0: binds the program's memory allocation to NUMA node 0.

Start the performance test on the client.

docker run -it --rm --cpuset-cpus="$(cat /sys/devices/system/node/node0/cpulist)" --cpuset-mems="0" --net host  nvcr.io/nvidia/tritonserver:24.05-py3-sdk perf_analyzer --concurrency-range 28:28:1 -p 8000 -f perf.csv -m deepfm --service-kind tfserving -i grpc --request-distribution poisson -b 128  -u localhost:8889 --percentile 99 --input-data=random

--cpuset-cpus: limits the container's processes to execute on the specified CPU cores.

--cpuset-mems: specifies the memory node bound to the container.

After the stress test is started, the server prints "KDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_KDNN_OPTS=0`." In this case, the function is enabled successfully.

KDNN is enabled by default. You can disable KDNN by setting the environment variable TF_ENABLE_KDNN_OPTS to 0.

Parent topic: Kunpeng TensorFlow Serving Best Practices