开发者
我要评分
获取效率
正确性
完整性
易理解
在线提单
论坛求助

矩阵算力使用分析与性能优化

本节介绍如何通过分析业务中矩阵计算使用情况,提高程序对矩阵算力指令利用率,提升整体性能。

  1. 使用DevKit tuner工具的HPC应用分析特性抓取目标程序的矩阵算力指令使用情况,执行如下命令:
    mpirun -np 1 devkit tuner hpc-perf -o /tmp/hpc-perf-detail.tar ./a.out

    其中a.out为应用程序的二进制文件。采集过程中会输出如下内容:

    ================================================================================
    Version     : DevKit 26.0.RC1
    CPU Model   : Kunpeng 920 72F8
    Command     : devkit tuner hpc-perf -o /tmp/hpc-perf-detail.tar ./a.out 
    ================================================================================
    =======================================PARAM CHECK=======================================
    PARAM CHECK success.
    ===================================PREPARE FOR COLLECT===================================
    preparation of collection success.
    [WARNING]Operation Tips:
    1. During the collection, the /tmp directory is used to store temporary files. Ensure that this directory has sufficient space. 
    2. If the collection is suspended, data may be lost. Ensure that the collection process is complete.
    ==================================COLLECT AND ANALYSIS ==================================
    ...
    ...
    The report /tmp/hpc-perf-detail.tar is generated successfully 
    To view the summary report,you can run: devkit report -i /tmp/hpc-perf-detail.tar
    To view the detail  report,you can import the report to WebUI or IDE
    =====================================RESTORE CLUSTER=====================================
    restore cluster success.
    =========================================FINISH =========================================
  2. 程序运行结束后,会将分析结果保存至/tmp/hpc-perf-detail.tar,可通过devkit report工具查看分析结果,执行命令如下:
    devkit report -i /tmp/hpc-perf-detail.tar
    返回信息如下:
    ================================================================================
    Version     : DevKit 26.0.RC1
    CPU Model   : Kunpeng 920 72F8
    Command     : devkit tuner report -i /tmp/hpc-perf-detail.tar 
    ================================================================================
    Elapsed Time                          :               2.2428 s (rank 000, jitter = 0.0000s)
    CPU Utilization                       :                 0.08 % (0.48 out of 608 CPUs)
      Cycles per Instruction (CPI)        :               0.7606  
      Instructions Retired                :           2824211908  
    MPI
    ===
      Top Hotspots on MPI Critical Path
      No data. Re-collect with -L detail --critical-path
      Top MPI Critical Path Segments
      No data. Re-collect with --critical-path
    Instruction Mix
    ===============
      Memory                              :                58.57 %
        Load                              :                56.30 %
        Store                             :                 7.30 %
      Scalar                              :                23.85 %
        Integer                           :                23.19 %
        Floating Point                    :                 0.66 %
      Vector                              :                 3.76 %
        Advanced SIMD                     :                 0.00 %
        SVE (+ loads/stores)              :                 3.46 %
        SME (retired)                     :                 0.30 %
          Integer                         :                 0.00 %
          Floating Point                  :                 0.30 %
      Crypto                              :                 0.00 %
      Branches                            :                13.29 %
        Immediate                         :                13.28 %
        Indirect                          :                 0.01 %
        Return                            :                 0.01 %
      Barriers                            :                 0.00 %
        Instruction Synchronization       :                 0.00 %
        Data Synchronization              :                 0.00 %
        Data Memory                       :                 0.00 %
      Not Retired                         :                 3.43 %
    
    Top-down
    ========
      Retiring                            :                32.87 %
      Backend Bound                       :                59.30 %
        Memory Bound                      :                23.92 %
          L1 Bound                        :                 4.14 %
          L2 Bound                        :                 0.92 %
          L3 or DRAM Bound                :                18.86 %
          Store Bound                     :                 0.00 %
        Core Bound                        :                35.39 %
      Frontend Bound                      :                 6.66 %
      Bad Speculation                     :                 1.17 %
    
    Memory subsystem
    ================
      Average DRAM Bandwidth              :               3.0807 GB/s
        Read                              :               2.4468 GB/s
        Write                             :               0.6339 GB/s
      Max Bandwidth                       :              15.0965 GB/s
      Bandwidth bound                     :                 0.00 %
        Top Hotspots within High Bandwidth
        No data. Re-collect with -L detail
      No data to show

    通过分析报告中的SME (retired)数值判断应用对矩阵算力的使用情况。更多DevKit的应用性能分析功能可参考DevKit官网:《系统性能分析Tuner-HPC应用分析》