Rate This Document
Findability
Accuracy
Completeness
Readability

Introduction

This document describes the basic concepts and implementation principles of the TensorFlow Serving (TF Serving) Accelerated Neural Network Compiler (ANNC) feature. It describes how to install and use the TensorFlow Serving ANNC optimization feature on openEuler 22.03 LTS SP3 running on the Kunpeng 920 7282C processor.

Kunpeng BoostKit provides this ANNC optimization solution to enhance TF Serving inference performance. ANNC is a compiler dedicated to accelerating neural network computing. It focuses on technologies including computational graph optimization, generation and integration of high-performance fused operators, and efficient code generation. These capabilities significantly improve inference performance in recommendation scenarios. ANNC is an extended acceleration suite. It is built on open source Open Accelerated Linear Algebra (OpenXLA), and hosted in the ANNC repository maintained by the openEuler community. The suite includes optimizations tailored for the Kunpeng platform, such as TensorFlow graph fusion, Accelerated Linear Algebra (XLA) graph fusion, and operator optimization.

The ANNC optimization feature integrates with the TensorFlow inference framework and XLA through compilation options and code patches. The following new features are introduced for TF Serving/TensorFlow 2.15:

  • TensorFlow graph fusion: fusion and rewriting of graphs at the TensorFlow model level.
  • XLA graph fusion: XLA graph fusion enhanced by ANNC.
  • Operator optimization: ANNC-driven operator optimization.

OpenXLA is an open ecosystem consisting of high-performance, portable, and scalable machine learning infrastructure components.

XLA is an open source compiler for machine learning. It optimizes models from the TensorFlow framework, to enable efficient execution across various hardware platforms including GPUs, CPUs, and machine learning accelerators.

Software Architecture

For details about the architecture, see Figure 1. For details about the module functions, see Table 1.

Figure 1 TF Serving software architecture
Table 1 TF Serving module functions

Module

Description

TF Serving

Dedicated, high-performance inference server optimized for TensorFlow model deployment

SavedModel

TensorFlow's standardized model format enabling seamless model import, inference, and retraining across diverse TensorFlow implementations

Graph Fusion

ANNC graph fusion module

TensorFlow

Open source machine learning framework specializing in deep learning model training and inference

ANNC

AI compiler optimized for machine learning models, which can compile models into high-performance executable code

XLA Extension

ANNC XLA extension

XLA

Open source compiler for machine learning

Kernels

TensorFlow operator implementation

Application Scenarios

The TensorFlow Serving ANNC feature is mainly used in recommendation systems and advertising delivery. It can greatly improve inference performance for coarse-ranking models in high-concurrency scenarios, boosting throughput while significantly reducing latency.