矩阵转置&block&vector示例
对于之前并行示例到矩阵转置&block示例运行时间进行优化。
调优思路



在保证cacheline对齐寻址的情况下,对矩阵B进行转置,并选择合适block size进行向量化指令优化。
操作步骤
- 运行intrinsics_transpose_B_matmult示例。
1
./matmul 8192 5
返回信息如下:
1 2 3
Size is 8192, Matrix multiplication method is: 5, Check correctness is: 0 Initialization time = 2.787161s Matrix multiplication time = 2.600979s
矩阵行列大小为8192情况下,并行计算耗时2.6秒左右。
- 创建intrinsics_transpose_B_matmult的Roofline任务。
使用命令行工具进行roofline任务分析。
1
devkit tuner roofline -o intrinsics_transpose_B_matmult_8192 -m region ./matmul 8192 5
返回信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Note: 1. Roofline task is currently only supported on the 920 platform. 2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application. 3. Roofline task collection needs to ensure the application has finished running. 4. The estimated time of roofline collection is about 3 * application estimated time. 5. Roofline analysis is available only on physical machines. 6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD RFCOLLECT: Start collection for ./matmul RFCOLLECT: Launch application to collect performance metrics of ./matmul Size is 8192, Matrix multiplication method is: 5, Check correctness is: 0 Initialization time = 2.751606s ROOFLINE_EVENTS are initialized. Matrix multiplication time = 2.741322s RFCOLLECT: Launch application to do binary instrumentation of ./matmul Size is 8192, Matrix multiplication method is: 5, Check correctness is: 0 Initialization time = 8.353003s Matrix multiplication time = 2.519457s RFCOLLECT: Launch benchmarks for measuring roofs RFCOLLECT: Processing all collected data RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-201408.json RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-201408.json" to get report. Get roofline report ... The roofline json report: /matrix_multiplication/intrinsics_transpose_B_matmult_8192.json The roofline html report: /matrix_multiplication/intrinsics_transpose_B_matmult_8192.html
- 查看intrinsics_transpose_B_matmult_8192报告。图4 intrinsics_transpose_B_matmult_8192报告
此时获取的roofs的并行度为128,获取到Elapsed Time 2.652s,GFLOP Count 1717.987,Performance 647.781 GFLOPS。
优化效果
使用intrinsics向量化指令后,计算方式发生很大的变化,计算量有47.1%的增加(从1168.231GFLOP到1717.987GFLOP),同时向量化指令也有更大的性能提升,因此端到端性能有了大幅提升,详见下表。
case |
Elapsed Time(s) |
GFLOP Count |
Performance |
单位时间性能倍率(相比于前一case) |
端到端性能倍率(相比于前一case) |
单位时间性能倍率(相比于基准case) |
端到端性能倍率(相比于基准case) |
---|---|---|---|---|---|---|---|
parallel_matmult_8192 |
516.824 |
1099.512 |
2.127 |
-- |
-- |
-- |
-- |
transpose_B_matmult_8192 |
10.017 |
1099.512 |
109.763 |
51.595 |
51.595 |
51.595 |
51.595 |
block_transpose_B_matmult_8192 |
3.646 |
1168.231 |
320.399 |
2.919 |
2.747 |
150.634 |
141.751 |
intrinsics_transpose_B_matmult_8192 |
2.652 |
1717.987 |
647.781 |
2.013 |
1.369 |
303.181 |
194.003 |


Web模式的Roofline分析任务支持对比任务,可以使用Web模式查看对比分析结果。
- Going UP:Performance提升了1倍左右,实际的端到端性能优化倍率为1.37倍左右。
- Going RIGHT:较少的右移, 使用向量化指令让计算变得更快,计算密度FLOP/Byte没有发生太大的变化(预期中)。