Usage Description

The TensorFlow Serving ANNC feature provides three optimization functions: TensorFlow graph fusion, XLA graph fusion, and operator optimization. This section describes how to enable each function.

TensorFlow Graph Fusion

Table 1 shows how to use the TensorFlow graph fusion interface.

Table 1 TensorFlow graph fusion interface

Command Line Interface

annc-opt

Function

Enables the graph fusion feature.

Parameter

-I /path/to/save_model.pb: model before graph fusion
-O /path/to/new_save_model.pb: model after graph fusion
pass: graph fusion policy (Currently, lookup_embedding_hash is supported.)

Example

annc-opt -I /base_model/wide_and_deep/1/ -O /optimized_model/wide_and_deep/1/ lookup_embedding_hash
cp -r /base_model/wide_and_deep/1/variables /optimized_model/wide_and_deep/1/

XLA Graph Fusion

Table 2 shows the XLA graph fusion interface.

Table 2 XLA graph fusion interface

Environment Variable

ANNC_FLAGS

Function

Compiles ANNC and enables XLA graph fusion optimization.

Example

export ANNC_FLAGS="--graph-opt"

Value Range

Enables the feature when the environment variable is --graph-opt.

Operator Optimization

For details about the operator optimization interfaces, see Table 3, Table 4, and Table 5.

Table 3 Interface for redundant operator optimization

Environment Variable

ENABLE_BISHENG_GRAPH_OPT

Function

Enables operator optimization.

Example

export ENABLE_BISHENG_GRAPH_OPT=""

Value Range

Enables the feature when the environment variable is not null.

Table 4 Interface for matrix operator optimization

Environment Variable

ANNC_FLAGS

Function

Enables operator optimization.

Example

export ANNC_FLAGS="--gemm-opt"

Value Range

Enables the feature when the environment variable is --gemm-opt.

Table 5 Interface for Softmax operator optimization

Environment Variable

XLA_FLAGS

Function

Enables operator optimization.

Example

export XLA_FLAGS="--xla_cpu_enable_xnnpack=true"

Value Range

Enables the feature when the environment variable is --xla_cpu_enable_xnnpack=true.

For details about how to use TF Serving to start the inference pressure test, see Starting the Service and Performing a Pressure Test in the TensorFlow Serving Porting Guide.

To help users better understand and use the ANNC feature introduced in this document, the following describes how to start the TF Serving service.

Perform TensorFlow graph fusion.

annc-opt -I /base_model/deepfm/1/ -O /optimized_model/deepfm/1/ lookup_embedding_hash
cp -r /base_model/deepfm/1/variables /optimized_model/deepfm/1/

Set the environment variables.

export ENABLE_BISHENG_GRAPH_OPT=""
export OMP_NUM_THREADS=1
export TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit --tf_xla_min_cluster_size=16"
export XLA_FLAGS="--xla_cpu_enable_xnnpack=true"
export ANNC_FLAGS="--gemm-opt --graph-opt"

Start the TF Serving service.

/path/to/tensorflow-serving/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=8889 --model_name=deepfm --model_base_path=/optimized_model/deepfm --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=-1 --xla_cpu_compilation_enabled=true

The model specified by --model_base_path is not subject to this restriction. You can download and use other models.

Start the pressure test on the client.

docker run -it --rm --net host  nvcr.io/nvidia/tritonserver:24.05-py3-sdk perf_analyzer --concurrency-range 28:28:1 -p 8561 -f perf.csv -m deepfm --service-kind tfserving -i grpc --request-distribution poisson -b 128  -u localhost:8889 --percentile 99 --input-data=random 

Parent topic: Feature Guide