矩阵转置型示例
查看矩阵行列大小为8192的指标
- 并行示例并行数已经调整到128,实际计算时间很短,2048 size不足以展示其功能,为了便于分析有效性,以8192 size运行相关任务。
- 以最基础的并行计算(method 1)做为基准case, 在此基础上进行method 2,4,5,6的相关case进行任务分析(由于method 3与method 2类似,不额外进行采集)。
- 请根据实际服务器中物理核数选择合适的矩阵size开展任务,可以由小到大的调整,最终确定合适的值。
- 运行矩阵行列大小为8192的parallel_matmult示例。
1
./matmul 8192 1
返回信息如下:
1 2 3
Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0 Initialization time = 2.751910s Matrix multiplication time = 521.832686s
矩阵行列大小为8192情况下,并行计算耗时607秒左右。
- 创建矩阵行列大小为8192的parallel_matmult示例的roofline任务。
1
devkit tuner roofline -o parallel_matmult_8192 -m region ./matmul 8192 1
返回信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Note: 1. Roofline task is currently only supported on the 920 platform. 2. The application must be a binary file in ELF format. 3. Roofline task collection needs to ensure the application has finished running. 4. The estimated time of roofline collection is about 3 * application estimated time. 5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD RFCOLLECT: Start collection for ./matmul RFCOLLECT: Launch application to collect performance metrics of ./matmul Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0 Initialization time = 2.793709s ROOFLINE_EVENTS are initialized. Matrix multiplication time = 516.902512s RFCOLLECT: Launch application to do binary instrumentation of ./matmul Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0 Initialization time = 8.352283s Matrix multiplication time = 496.935396s RFCOLLECT: Launch benchmarks for measuring roofs RFCOLLECT: Processing all collected data RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report. Get roofline report ... The roofline json report: /matrix_multiplication/parallel_matmult_8192.json The roofline html report: /matrix_multiplication/parallel_matmult_8192.html
- 查看parallel_matmult_8192_html报告。
此时获取的roofs的并行度为128,获取到Elapsed Time 516.824s,GFLOP Count 1099.512,Performance 2.127 GFLOPS。
后续示例将以parallel_matmult_8192_html报告结果作为基准进行比较。
调优思路
- 图1中所有的点均处于Memory bound区域。
- 当前内存瓶颈在DDR和L3,主要是DDR。
- 缓存局部效应:L1的计算密度(FLOP/BYTE)最大,即L1上命中的数据最少。
图2 数据的不同寻址Size图3 数据的不同寻址方式
- Sequential load:逐步读取数组的每个元素,即full cache line utilization为100%(good)。
- Stride load:从缓存行读取数组的每第4个元素,即cache line utilization为25% (worse),该情况下导致L2 Bytes大于L1 Bytes,因此L2 Flop/Byte小于L1 Flop/Byte。
- Random load:随机访问情况下,缓存访问没有局部性,cache line utilization较随机且必定很差(bad)。
从上述分析可以知道,当前的程序的瓶颈点在于内存,在于L1 cache利用率不足。
操作步骤
通过对矩阵B进行转置,保证cache line对齐寻址,实现Sequential load模式,提高cache命中率,优化并行示例中的性能瓶颈。
图4 矩阵转置示例

图5 矩阵转置代码

- 运行矩阵行列大小为8192的transpose_B_matmult示例。
1
./matmul 8192 2
返回信息如下:
1 2 3
Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0 Initialization time = 2.752044s Matrix multiplication time = 12.562781s
矩阵行列大小为8192情况下,并行计算耗时12.5秒左右。
- 创建矩阵行列大小为8192的transpose_B_matmult示例的Roofline任务。
1
devkit tuner roofline -m region -o transpose_B_matmult_8192 ./matmul 8192 2
返回信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Note: 1. Roofline task is currently only supported on the 920 platform. 2. The application must be a binary file in ELF format. 3. Roofline task collection needs to ensure the application has finished running. 4. The estimated time of roofline collection is about 3 * application estimated time. 5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD RFCOLLECT: Start collection for ./matmul RFCOLLECT: Launch application to collect performance metrics of ./matmul Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0 Initialization time = 2.793919s ROOFLINE_EVENTS are initialized. Matrix multiplication time = 10.419904s RFCOLLECT: Launch application to do binary instrumentation of ./matmul Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0 Initialization time = 8.543225s Matrix multiplication time = 11.596018s RFCOLLECT: Launch benchmarks for measuring roofs RFCOLLECT: Processing all collected data RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report. Get roofline report ... The roofline json report: /matrix_multiplication/transpose_B_matmult_8192.json The roofline html report: /matrix_multiplication/transpose_B_matmult_8192.html
- 查看transpose_B_matmult_8192_html报告。图6 transpose_B_matmult_8192_html报告
此时获取的roofs的并行度为128,获取到Elapsed Time 10.017s,GFLOP Count 1099.512,Performance 109.763 GFLOPS。
优化效果
- L1处于Memory Bound区域,L2、L3和DDR处于Compute and Memory Bound区域。
- 当前内存瓶颈主要在L1和L3,较好的利用了高速cache介质。
- 缓存局部效应Cache line utilization相比与之前变好,计算密度(Flop/Byte):L1 < L2 < L3 < DDR。
case |
Elapsed Time(s) |
GFLOP Count |
Performance |
单位时间性能倍率(相比于前一case) |
端到端性能倍率(相比于前一case) |
单位时间性能倍率(相比于基准case) |
端到端性能倍率(相比于基准case) |
---|---|---|---|---|---|---|---|
parallel_matmult_8192 |
516.824 |
1099.512 |
2.127 |
-- |
-- |
-- |
-- |
transpose_B_matmult_8192 |
10.017 |
1099.512 |
109.763 |
51.595 |
51.595 |
51.595 |
51.595 |
图7 对比分析

- Going UP:Performance提升了50.595倍。
- Going RIGHT:优化访存,增加程序性能优化的空间。
父主题: 使用Roofline进行性能分析