Introduction
This document presents the basic concepts and implementation principles of the thread scheduling optimization feature for TensorFlow Serving (TF Serving), along with comprehensive instructions for deploying these enhancements on the new Kunpeng 920 processor model within openEuler 22.03 LTS SP3.
Kunpeng BoostKit developed a thread scheduling optimization solution to enhance TF Serving inference performance. TensorFlow employs inter-operator thread pools to parallelize independent operators, this approach can lead to task contention in high-concurrency scenarios when multiple sessions share the same thread pool, substantially degrading computational efficiency for entire graphs. Kunpeng BoostKit's solution addresses this limitation through refined operator scheduling algorithms and advanced thread management optimizations, delivering significant throughput improvements for concurrent model inference.
Implemented as patches integrated into openEuler's sra_tensorflow_adapter repository, these optimizations introduce two new configuration parameters for TF Serving/TensorFlow 2.15:
- Batch operator scheduling (--batch_op_scheduling): Activates optimized operator scheduling and Accelerated Linear Algebra (XLA) thread pool management. Recommended when single-core inference latency meets requirements, this option enhances concurrent processing capability and overall throughput.
- Thread affinity isolation (--task_affinity_isolation): Offers two isolation methods. When TensorFlow native scheduling is used, sequential core binding is recommended. When both batch operator scheduling and thread affinity isolation are used, interleaved core binding is recommended.
- Sequential core binding allocates TensorFlow computing threads to the first K cores and TF Serving communication threads to remaining cores.
- Interleaved core binding assigns TensorFlow threads to physical cores and TF Serving threads to virtual cores (recommended when hyper-threading is enabled).
XLA serves as TensorFlow's optimizing compiler, specifically designed to enhance the execution speed of linear algebra operations. By transforming TensorFlow computational graphs into highly efficient, hardware-specific instructions, XLA delivers significant performance improvements.