Model Quantization

Atlas 800I A2 inference servers support quantization inference of models such as Llama3.1-70B and Qwen2.5-70B. This method depends on the quantized weights generated by the msModelSlim tool.

Install msModelSlim.

git clone -b br_release_MindStudio_8.0.RC1_20260324 https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh

Quantize the model. (The following uses DeepSeek-R1-Distill-Llama-70B W8A8 quantization as an example. For details about other quantization methods, see LLaMA quantization cases.)

Go to the msit/msmodelslim/example/Llama directory and run the following command:

python3 quant_llama.py --model_path {Floating-point weight path} --save_directory {Path of W8A8-quantized weights} --calib_file ../common/boolq.jsonl  --device_type npu --disable_level L5 --anti_method m3 --act_method 3

Start the model performance test. For details, see Performance Test Method.

Parent topic: Performance Optimization