Model Quantization
Atlas 800I A2 inference servers support quantization inference of models such as Llama3.1-70B and Qwen2.5-70B. This method depends on the quantized weights generated by the msModelSlim tool.
- Install msModelSlim.
git clone -b br_release_MindStudio_8.0.RC1_20260324 https://gitee.com/ascend/msit.git cd msit/msmodelslim bash install.sh
- Quantize the model. (The following uses DeepSeek-R1-Distill-Llama-70B W8A8 quantization as an example. For details about other quantization methods, see LLaMA quantization cases.)
- Go to the msit/msmodelslim/example/Llama directory and run the following command:
python3 quant_llama.py --model_path {Floating-point weight path} --save_directory {Path of W8A8-quantized weights} --calib_file ../common/boolq.jsonl --device_type npu --disable_level L5 --anti_method m3 --act_method 3 - Start the model performance test. For details, see Performance Test Method.
Parent topic: Performance Optimization