Rate This Document
Findability
Accuracy
Completeness
Readability

Introduction

This document details the deployment procedures for vLLM, vLLM-Ascend, and MindIE Turbo frameworks on Atlas 800I A2 inference servers running on Kunpeng 920 processors, covering both execution and tuning techniques for the DeepSeek 70B model.

With AI computing capabilities advancing rapidly and clients demanding higher performance, the current Ascend-Kunpeng appliances require inference performance enhancements to remain competitive. The DeepSeek 70B model, created by distilling DeepSeek R1 into LLaMA 70B, operates across these three inference frameworks: vLLM, vLLM-Ascend, and MindIE Turbo.

The tuning approaches outlined in this document integrate profiling data from performance analysis tools (such as perf and profile) with comparative test results. Implementation requires tailoring optimization strategies according to specific hardware configurations and performance metrics.