鲲鹏社区首页
中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

矩阵转置型示例

查看矩阵行列大小为8192的指标

  1. 并行示例并行数已经调整到128,实际计算时间很短,2048 size不足以展示其功能,为了便于分析有效性,以8192 size运行相关任务。
  2. 以最基础的并行计算(method 1)做为基准case, 在此基础上进行method 2,4,5,6的相关case进行任务分析(由于method 3与method 2类似,不额外进行采集)。
  3. 请根据实际服务器中物理核数选择合适的矩阵size开展任务,可以由小到大的调整,最终确定合适的值。
  1. 运行矩阵行列大小为8192的parallel_matmult示例。
    1
    ./matmul 8192 1
    

    返回信息如下:

    1
    2
    3
    Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 2.751910s
    Matrix multiplication time = 521.832686s
    

    矩阵行列大小为8192情况下,并行计算耗时607秒左右。

  2. 创建矩阵行列大小为8192的parallel_matmult示例的roofline任务。
    1
    devkit tuner roofline -o parallel_matmult_8192 -m region ./matmul 8192 1
    

    返回信息如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    Note:
      1. Roofline task is currently only supported on the 920 platform.
      2. The application must be a binary file in ELF format.
      3. Roofline task collection needs to ensure the application has finished running.
      4. The estimated time of roofline collection is about 3 * application estimated time.
      5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 2.793709s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 516.902512s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 8.352283s
    Matrix multiplication time = 496.935396s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/parallel_matmult_8192.json
    The roofline html report: /matrix_multiplication/parallel_matmult_8192.html
    
  3. 查看parallel_matmult_8192_html报告。
    图1 parallel_matmult_8192_html报告

    此时获取的roofs的并行度为128,获取到Elapsed Time 516.824s,GFLOP Count 1099.512,Performance 2.127 GFLOPS。

    后续示例将以parallel_matmult_8192_html报告结果作为基准进行比较。

调优思路

  1. 图1中所有的点均处于Memory bound区域。
  2. 当前内存瓶颈在DDR和L3,主要是DDR。
  3. 缓存局部效应:L1的计算密度(FLOP/BYTE)最大,即L1上命中的数据最少。
    图2 数据的不同寻址Size
    图3 数据的不同寻址方式
    • Sequential load:逐步读取数组的每个元素,即full cache line utilization为100%(good)。
    • Stride load:从缓存行读取数组的每第4个元素,即cache line utilization为25% (worse),该情况下导致L2 Bytes大于L1 Bytes,因此L2 Flop/Byte小于L1 Flop/Byte。
    • Random load:随机访问情况下,缓存访问没有局部性,cache line utilization较随机且必定很差(bad)。

从上述分析可以知道,当前的程序的瓶颈点在于内存,在于L1 cache利用率不足。

操作步骤

通过对矩阵B进行转置,保证cache line对齐寻址,实现Sequential load模式,提高cache命中率,优化并行示例中的性能瓶颈。

图4 矩阵转置示例
图5 矩阵转置代码
  1. 运行矩阵行列大小为8192的transpose_B_matmult示例。
    1
    ./matmul 8192 2
    

    返回信息如下:

    1
    2
    3
    Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
    Initialization time = 2.752044s
    Matrix multiplication time = 12.562781s
    

    矩阵行列大小为8192情况下,并行计算耗时12.5秒左右。

  2. 创建矩阵行列大小为8192的transpose_B_matmult示例的Roofline任务。
    1
    devkit tuner roofline  -m region -o transpose_B_matmult_8192 ./matmul 8192 2
    

    返回信息如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    Note:
      1. Roofline task is currently only supported on the 920 platform.
      2. The application must be a binary file in ELF format.
      3. Roofline task collection needs to ensure the application has finished running.
      4. The estimated time of roofline collection is about 3 * application estimated time.
      5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
    Initialization time = 2.793919s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 10.419904s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
    Initialization time = 8.543225s
    Matrix multiplication time = 11.596018s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/transpose_B_matmult_8192.json
    The roofline html report: /matrix_multiplication/transpose_B_matmult_8192.html
    
  3. 查看transpose_B_matmult_8192_html报告。
    图6 transpose_B_matmult_8192_html报告

    此时获取的roofs的并行度为128,获取到Elapsed Time 10.017s,GFLOP Count 1099.512,Performance 109.763 GFLOPS。

优化效果

  1. L1处于Memory Bound区域,L2、L3和DDR处于Compute and Memory Bound区域。
  2. 当前内存瓶颈主要在L1和L3,较好的利用了高速cache介质。
  3. 缓存局部效应Cache line utilization相比与之前变好,计算密度(Flop/Byte):L1 < L2 < L3 < DDR。
表1 性能对比分析

case

Elapsed Time(s)

GFLOP Count

Performance

单位时间性能倍率(相比于前一case)

端到端性能倍率(相比于前一case)

单位时间性能倍率(相比于基准case)

端到端性能倍率(相比于基准case)

parallel_matmult_8192

516.824

1099.512

2.127

--

--

--

--

transpose_B_matmult_8192

10.017

1099.512

109.763

51.595

51.595

51.595

51.595

图7 对比分析
  • Going UP:Performance提升了50.595倍。
  • Going RIGHT:优化访存,增加程序性能优化的空间。