Introduction

This document describes the basic concepts and implementation principles of the TensorFlow Serving (TF Serving) Accelerated Neural Network Compiler (ANNC) feature. It describes how to install and use the TensorFlow Serving ANNC optimization feature on openEuler 22.03 LTS SP3 running on the new Kunpeng 920 processor model.

Kunpeng BoostKit provides this ANNC optimization solution to enhance TF Serving inference performance. ANNC is a compiler dedicated to accelerating neural network computing. It focuses on technologies including computational graph optimization, generation and integration of high-performance fused operators, and efficient code generation. These capabilities significantly improve inference performance in recommendation scenarios. ANNC is an extended acceleration suite. It is built on open source Open Accelerated Linear Algebra (OpenXLA), and hosted in the ANNC repository maintained by the openEuler community. The suite includes optimizations tailored for the Kunpeng platform, such as TensorFlow graph fusion, Accelerated Linear Algebra (XLA) graph fusion, and operator optimization.

The ANNC optimization feature integrates with the TensorFlow inference framework and XLA through compilation options and code patches. The following new features are introduced for TF Serving/TensorFlow 2.15:

TensorFlow graph fusion: fusion and rewriting of graphs at the TensorFlow model level.
XLA graph fusion: XLA graph fusion enhanced by ANNC.
Operator optimization: ANNC-driven operator optimization.

OpenXLA is an open ecosystem consisting of high-performance, portable, and scalable machine learning infrastructure components.

XLA is an open source compiler for machine learning. It optimizes models from the TensorFlow framework, to enable efficient execution across various hardware platforms including GPUs, CPUs, and machine learning accelerators.

Software Architecture

For details about the architecture, see Figure 1. For details about the component functions, see Table 1.

Figure 1 TF Serving software architecture

**Table 1** TF Serving component functions
Component	Description
TF Serving	Dedicated, high-performance inference server optimized for TensorFlow model deployment
SavedModel	TensorFlow's standardized model format enabling seamless model import, inference, and retraining across diverse TensorFlow implementations
Graph Fusion	ANNC graph fusion component
TensorFlow	Open source machine learning framework specializing in deep learning model training and inference
ANNC	AI compiler optimized for machine learning models, which can compile models into high-performance executable code
XLA Extension	ANNC XLA extension
XLA	Open source compiler for machine learning
Kernels	TensorFlow operator implementation

Application Scenarios

The TensorFlow Serving ANNC feature is mainly used in recommendation systems and advertising delivery. It can greatly improve inference performance for coarse-ranking models in high-concurrency scenarios, boosting throughput while significantly reducing latency.

Parent topic: Feature Description