矩阵算力使用分析与性能优化

本节介绍如何通过分析业务中矩阵计算使用情况，提高程序对矩阵算力指令利用率，提升整体性能。

使用DevKit tuner工具的HPC应用分析特性抓取目标程序的矩阵算力指令使用情况，执行如下命令：

mpirun -np 1 devkit tuner hpc-perf -o /tmp/hpc-perf-detail.tar ./a.out

其中a.out为应用程序的二进制文件。采集过程中会输出如下内容：

================================================================================
Version     : DevKit 26.0.RC1
CPU Model   : Kunpeng 920 72F8
Command     : devkit tuner hpc-perf -o /tmp/hpc-perf-detail.tar ./a.out 
================================================================================
=======================================PARAM CHECK=======================================
PARAM CHECK success.
===================================PREPARE FOR COLLECT===================================
preparation of collection success.
[WARNING]Operation Tips:
1. During the collection, the /tmp directory is used to store temporary files. Ensure that this directory has sufficient space. 
2. If the collection is suspended, data may be lost. Ensure that the collection process is complete.
==================================COLLECT AND ANALYSIS ==================================
...
...
The report /tmp/hpc-perf-detail.tar is generated successfully 
To view the summary report，you can run: devkit report -i /tmp/hpc-perf-detail.tar
To view the detail  report，you can import the report to WebUI or IDE
=====================================RESTORE CLUSTER=====================================
restore cluster success.
=========================================FINISH =========================================

程序运行结束后，会将分析结果保存至/tmp/hpc-perf-detail.tar，可通过devkit report工具查看分析结果，执行命令如下：

devkit report -i /tmp/hpc-perf-detail.tar

返回信息如下：

================================================================================
Version     : DevKit 26.0.RC1
CPU Model   : Kunpeng 920 72F8
Command     : devkit tuner report -i /tmp/hpc-perf-detail.tar 
================================================================================
Elapsed Time                          :               2.2428 s (rank 000, jitter = 0.0000s)
CPU Utilization                       :                 0.08 % (0.48 out of 608 CPUs)
  Cycles per Instruction (CPI)        :               0.7606  
  Instructions Retired                :           2824211908  
MPI
===
  Top Hotspots on MPI Critical Path
  No data. Re-collect with -L detail --critical-path
  Top MPI Critical Path Segments
  No data. Re-collect with --critical-path
Instruction Mix
===============
  Memory                              :                58.57 %
    Load                              :                56.30 %
    Store                             :                 7.30 %
  Scalar                              :                23.85 %
    Integer                           :                23.19 %
    Floating Point                    :                 0.66 %
  Vector                              :                 3.76 %
    Advanced SIMD                     :                 0.00 %
    SVE (+ loads/stores)              :                 3.46 %
    SME (retired)                     :                 0.30 %
      Integer                         :                 0.00 %
      Floating Point                  :                 0.30 %
  Crypto                              :                 0.00 %
  Branches                            :                13.29 %
    Immediate                         :                13.28 %
    Indirect                          :                 0.01 %
    Return                            :                 0.01 %
  Barriers                            :                 0.00 %
    Instruction Synchronization       :                 0.00 %
    Data Synchronization              :                 0.00 %
    Data Memory                       :                 0.00 %
  Not Retired                         :                 3.43 %

Top-down
========
  Retiring                            :                32.87 %
  Backend Bound                       :                59.30 %
    Memory Bound                      :                23.92 %
      L1 Bound                        :                 4.14 %
      L2 Bound                        :                 0.92 %
      L3 or DRAM Bound                :                18.86 %
      Store Bound                     :                 0.00 %
    Core Bound                        :                35.39 %
  Frontend Bound                      :                 6.66 %
  Bad Speculation                     :                 1.17 %

Memory subsystem
================
  Average DRAM Bandwidth              :               3.0807 GB/s
    Read                              :               2.4468 GB/s
    Write                             :               0.6339 GB/s
  Max Bandwidth                       :              15.0965 GB/s
  Bandwidth bound                     :                 0.00 %
    Top Hotspots within High Bandwidth
    No data. Re-collect with -L detail
  No data to show

通过分析报告中的SME (retired)数值判断应用对矩阵算力的使用情况。更多DevKit的应用性能分析功能可参考DevKit官网：《系统性能分析Tuner-HPC应用分析》

父主题： 应用分析