Description
The TensorFlow Serving ANNC feature provides three optimization functions: TensorFlow graph fusion, XLA graph fusion, and operator optimization. This section describes how to enable each function.
TensorFlow Graph Fusion
Table 1 shows how to use the TensorFlow graph fusion interface.
Command line interface |
annc-opt |
||
|---|---|---|---|
Function |
Enables the graph fusion feature. |
||
Parameter |
|
||
Example |
|
XLA Graph Fusion
Table 2 shows the XLA graph fusion interface.
Operator Optimization
For details about the operator optimization interfaces, see Table 3, Table 4, and Table 5.
Environment variable |
ANNC_FLAGS |
||
|---|---|---|---|
Function |
Enables operator optimization. |
||
Example |
|
||
Value Range |
Enables the feature when the environment variable is not null. |
For details about how to use TF Serving to start the inference pressure test, see Starting the Service and Performing a Pressure Test in the TensorFlow Serving Porting Guide.
To help users better understand and use the ANNC feature introduced in this document, the following describes how to start the TF Serving service.
- Perform TensorFlow graph fusion.
1 2
annc-opt -I /base_model/deepfm/1/ -O /optimized_model/deepfm/1/ lookup_embedding_hash cp -r /base_model/deepfm/1/variables /optimized_model/deepfm/1/
- Set the environment variables.
1 2 3 4 5
export ENABLE_BISHENG_GRAPH_OPT="" export OMP_NUM_THREADS=1 export TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit --tf_xla_min_cluster_size=16" export XLA_FLAGS="--xla_cpu_enable_xnnpack=true" export ANNC_FLAGS="--gemm-opt --graph-opt"
- Start the TF Serving service.
1/path/to/tensorflow-serving/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=8889 --model_name=deepfm --model_base_path=/optimized_model/deepfm --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=-1 --xla_cpu_compilation_enabled=true
The model specified by --model_base_path is not subject to this restriction. You can download and use other models.
- Start the pressure test on the client.
1docker run -it --rm --net host nvcr.io/nvidia/tritonserver:24.05-py3-sdk perf_analyzer --concurrency-range 28:28:1 -p 8561 -f perf.csv -m deepfm --service-kind tfserving -i grpc --request-distribution poisson -b 128 -u localhost:8889 --percentile 99 --input-data=random