Rate This Document
Findability
Accuracy
Completeness
Readability

Model Quantization

Atlas 800I A2 inference servers support quantization inference of models such as Llama3.1-70B and Qwen2.5-70B. This method depends on the quantized weights generated by the msModelSlim tool.

  1. Install msModelSlim.
    git clone -b br_release_MindStudio_8.0.RC1_20260324 https://gitee.com/ascend/msit.git
    cd msit/msmodelslim
    bash install.sh
  2. Quantize the model. (The following uses DeepSeek-R1-Distill-Llama-70B W8A8 quantization as an example. For details about other quantization methods, see LLaMA quantization cases.)
  3. Go to the msit/msmodelslim/example/Llama directory and run the following command:
    python3 quant_llama.py --model_path {Floating-point weight path} --save_directory {Path of W8A8-quantized weights} --calib_file ../common/boolq.jsonl  --device_type npu --disable_level L5 --anti_method m3 --act_method 3
  4. Start the model performance test. For details, see Performance Test Method.