鲲鹏社区首页
中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

矩阵转置型示例

查看矩阵行列大小为8192的指标

  1. 并行示例并行数已经调整到128,实际计算时间很短,2048 size不足以展示其功能,为了便于分析有效性,以8192 size运行相关任务。
  2. 以最基础的并行计算(method 1)做为基准case, 在此基础上进行method 2,4,5,6的相关case进行任务分析(由于method 3与method 2类似,不额外进行采集)。
  3. 请根据实际服务器中物理核数选择合适的矩阵size开展任务,可以由小到大的调整,最终确定合适的值。
  1. 运行矩阵行列大小为8192的parallel_matmult示例。
    1
    ./matmul 8192 1
    

    返回信息如下:

    1
    2
    3
    Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 2.751910s
    Matrix multiplication time = 521.832686s
    

    矩阵行列大小为8192情况下,并行计算耗时607秒左右。

  2. 创建矩阵行列大小为8192的parallel_matmult示例的roofline任务。
    1
    devkit tuner roofline -o parallel_matmult_8192 -m region ./matmul 8192 1
    

    返回信息如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    Note:
        1. Roofline task is currently only supported on the 920 platform.
        2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application.
        3. Roofline task collection needs to ensure the application has finished running.
        4. The estimated time of roofline collection is about 3 * application estimated time.
        5. Roofline analysis is available only on physical machines.
        6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 2.793709s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 516.902512s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 8.352283s
    Matrix multiplication time = 496.935396s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/parallel_matmult_8192.json
    The roofline html report: /matrix_multiplication/parallel_matmult_8192.html
    
  3. 查看parallel_matmult_8192_html报告。
    图1 parallel_matmult_8192_html报告

    此时获取的roofs的并行度为128,获取到Elapsed Time 516.824s,GFLOP Count 1099.512,Performance 2.127 GFLOPS。

    后续示例将以parallel_matmult_8192_html报告结果作为基准进行比较。

调优思路

  1. 图1中所有的点均处于Memory bound区域。
  2. 当前内存瓶颈在DDR和L3,主要是DDR。
  3. 缓存局部效应:L1的计算密度(FLOP/BYTE)最大,即L1上命中的数据最少。
    图2 数据的不同寻址Size
    图3 数据的不同寻址方式
    • Sequential load:逐步读取数组的每个元素,即full cache line utilization为100%(good)。
    • Stride load:从缓存行读取数组的每第4个元素,即cache line utilization为25% (worse),该情况下导致L2 Bytes大于L1 Bytes,因此L2 Flop/Byte小于L1 Flop/Byte。
    • Random load:随机访问情况下,缓存访问没有局部性,cache line utilization较随机且必定很差(bad)。

从上述分析可以知道,当前的程序的瓶颈点在于内存,在于L1 cache利用率不足。

操作步骤

通过对矩阵B进行转置,保证cache line对齐寻址,实现Sequential load模式,提高cache命中率,优化并行示例中的性能瓶颈。

图4 矩阵转置示例
图5 矩阵转置代码
  1. 运行矩阵行列大小为8192的transpose_B_matmult示例。
    1
    ./matmul 8192 2
    

    返回信息如下:

    1
    2
    3
    Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
    Initialization time = 2.752044s
    Matrix multiplication time = 12.562781s
    

    矩阵行列大小为8192情况下,并行计算耗时12.5秒左右。

  2. 创建矩阵行列大小为8192的transpose_B_matmult示例的Roofline任务。
    1
    devkit tuner roofline  -m region -o transpose_B_matmult_8192 ./matmul 8192 2
    

    返回信息如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    Note:
      1. Roofline task is currently only supported on the 920 platform.
      2. The application must be a binary file in ELF format.
      3. Roofline task collection needs to ensure the application has finished running.
      4. The estimated time of roofline collection is about 3 * application estimated time.
      5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
    Initialization time = 2.793919s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 10.419904s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
    Initialization time = 8.543225s
    Matrix multiplication time = 11.596018s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/transpose_B_matmult_8192.json
    The roofline html report: /matrix_multiplication/transpose_B_matmult_8192.html
    
  3. 查看transpose_B_matmult_8192_html报告。
    图6 transpose_B_matmult_8192_html报告

    此时获取的roofs的并行度为128,获取到Elapsed Time 10.017s,GFLOP Count 1099.512,Performance 109.763 GFLOPS。

优化效果

  1. L1处于Memory Bound区域,L2、L3和DDR处于Compute and Memory Bound区域。
  2. 当前内存瓶颈主要在L1和L3,较好的利用了高速cache介质。
  3. 缓存局部效应Cache line utilization相比与之前变好,计算密度(Flop/Byte):L1 < L2 < L3 < DDR。
表1 性能对比分析

case

Elapsed Time(s)

GFLOP Count

Performance

单位时间性能倍率(相比于前一case)

端到端性能倍率(相比于前一case)

单位时间性能倍率(相比于基准case)

端到端性能倍率(相比于基准case)

parallel_matmult_8192

516.824

1099.512

2.127

--

--

--

--

transpose_B_matmult_8192

10.017

1099.512

109.763

51.595

51.595

51.595

51.595

图7 对比分析
  • Going UP:Performance提升了50.595倍。
  • Going RIGHT:优化访存,增加程序性能优化的空间。

    Web模式的Roofline分析任务支持对比任务,可以使用Web模式查看对比分析结果。