Rate This Document
Findability
Accuracy
Completeness
Readability

Introduction

This document details the deployment of the DeepSeek 70B model in an environment with Kunpeng 920 5250 and two Atlas 300I Duo inference cards, along with performance tuning procedures.

Direct execution of the DeepSeek 70B model on Ascend and Kunpeng hardware demonstrates suboptimal performance. With AI computing capabilities advancing rapidly and clients demanding higher performance, the current Ascend-Kunpeng appliances require inference performance enhancements to remain competitive.

The DeepSeek 70B model is derived from the LLaMA 70B architecture through DeepSeek R1 distillation and subsequently deployed via MindIE. Consequently, familiarity with key concepts is essential, including knowledge distillation, MindIE, DeepSeek R1, and LLaMA.

Knowledge Distillation

Knowledge distillation, as shown in Figure 1, is a training technique where a smaller "student" model is trained to mimic the behavior of a larger "teacher" model. Rather than using the ground-truth labels, the student model learns from the soft probabilities of the teacher model, which allows it to capture the richer relationships between different categories.

Figure 1 Knowledge distillation principle

MindIE

Mind Inference Engine (MindIE) is an inference acceleration suite provided by Huawei Ascend for various AI scenarios. As shown in Figure 2, through layered open AI capabilities, it supports diversified AI service requirements and empowers a large number of models by leveraging the computing power of Ascend hardware devices. It is compatible with multiple mainstream AI frameworks and connects to different types of Ascend AI Processors. With multi-layer programming interfaces, it helps users quickly build inference services based on the Ascend platform.

Figure 2 MindIE architecture

DeepSeek R1

DeepSeek R1 replaces standard multi-head attention with multi-head latent attention (MLA) in all Transformer layers, as shown in Figure 3. Unlike the rest, the first three layers retain conventional feedforward network (FFN) structures. Starting from the fourth layer up to the 61st, mixture of experts (MoE) layers are used in place of FFN layers.

Figure 3 DeepSeek R1 model structure

LLaMA

LLaMA, an efficient large-scale language model series released by Meta shown in Figure 4, delivers scalable solutions for diverse natural language processing (NLP) tasks. Its open source accessibility, adaptability, and optimized performance make it a versatile tool adopted across academia, industry, and enterprise applications. Distinct from other LLMs, LLaMA excels in computational efficiency, open source availability, and broad applicability. The model architecture, illustrated in the figure, comprises stacked Attention and MLP layers.

Figure 4 LLaMA model structure