Introduction
This document describes the basic concepts and implementation principles of the TensorFlow Serving (TF Serving) Accelerated Neural Network Compiler (ANNC) feature. It describes how to install and use the TensorFlow Serving ANNC optimization feature on openEuler 22.03 LTS SP3 running on the Kunpeng 920 7282C processor.
Kunpeng BoostKit provides this ANNC optimization solution to enhance TF Serving inference performance. ANNC is a compiler dedicated to accelerating neural network computing. It focuses on technologies including computational graph optimization, generation and integration of high-performance fused operators, and efficient code generation. These capabilities significantly improve inference performance in recommendation scenarios. ANNC is an extended acceleration suite. It is built on open source Open Accelerated Linear Algebra (OpenXLA), and hosted in the ANNC repository maintained by the openEuler community. The suite includes optimizations tailored for the Kunpeng platform, such as TensorFlow graph fusion, Accelerated Linear Algebra (XLA) graph fusion, and operator optimization.
The ANNC optimization feature integrates with the TensorFlow inference framework and XLA through compilation options and code patches. The following new features are introduced for TF Serving/TensorFlow 2.15:
- TensorFlow graph fusion: fusion and rewriting of graphs at the TensorFlow model level.
- XLA graph fusion: XLA graph fusion enhanced by ANNC.
- Operator optimization: ANNC-driven operator optimization.
OpenXLA is an open ecosystem consisting of high-performance, portable, and scalable machine learning infrastructure components.
XLA is an open source compiler for machine learning. It optimizes models from the TensorFlow framework, to enable efficient execution across various hardware platforms including GPUs, CPUs, and machine learning accelerators.
Software Architecture
For details about the architecture, see Figure 1. For details about the module functions, see Table 1.
Module |
Description |
|---|---|
TF Serving |
Dedicated, high-performance inference server optimized for TensorFlow model deployment |
SavedModel |
TensorFlow's standardized model format enabling seamless model import, inference, and retraining across diverse TensorFlow implementations |
Graph Fusion |
ANNC graph fusion module |
TensorFlow |
Open source machine learning framework specializing in deep learning model training and inference |
ANNC |
AI compiler optimized for machine learning models, which can compile models into high-performance executable code |
XLA Extension |
ANNC XLA extension |
XLA |
Open source compiler for machine learning |
Kernels |
TensorFlow operator implementation |
Application Scenarios
The TensorFlow Serving ANNC feature is mainly used in recommendation systems and advertising delivery. It can greatly improve inference performance for coarse-ranking models in high-concurrency scenarios, boosting throughput while significantly reducing latency.
