串行示例
运行最基本的串行示例,由于串行示例耗时较长,选取行列大小均为2048的矩阵进行任务分析。
该部分代码通过串行方式实现矩阵相乘。
图1 串行示例代码

- 运行矩阵行列大小为2048的base_matmult的示例。
1
./matmul 2048 0
返回信息如下:
1 2 3
Size is 2048, Matrix multiplication method is: 0, Check correctness is: 0 Initialization time = 0.174492s Matrix multiplication time = 62.657254s
矩阵行列大小为2048情况下,串行计算耗时62秒左右。
- 创建矩阵行列大小为2048的base_matmult的Roofline任务。
使用命令行工具进行roofline任务分析。
1
devkit tuner roofline -o base_matmult_2048 -m region ./matmul 2048 0
示例中均使用鲲鹏DevKit命令行模式,也可使用Web模式的Roofline任务进行分析。
返回信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Note: 1. Roofline task is currently only supported on the 920 platform. 2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application. 3. Roofline task collection needs to ensure the application has finished running. 4. The estimated time of roofline collection is about 3 * application estimated time. 5. Roofline analysis is available only on physical machines. 6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD RFCOLLECT: Start collection for ./matmul RFCOLLECT: Launch application to collect performance metrics of ./matmul Size is 2048, Matrix multiplication method is: 0, Check correctness is: 0 Initialization time = 0.174628s ROOFLINE_EVENTS are initialized. Matrix multiplication time = 62.718666s RFCOLLECT: Launch application to do binary instrumentation of ./matmul Size is 2048, Matrix multiplication method is: 0, Check correctness is: 0 Initialization time = 0.528283s Matrix multiplication time = 85.328236s RFCOLLECT: Launch benchmarks for measuring roofs RFCOLLECT: Processing all collected data RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-151117.json RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-151117.json" to get report. Get roofline report ... The roofline json report: /matrix_multiplication/base_matmult_2048.json The roofline html report: /matrix_multiplication/base_matmult_2048.html
- 查看base_matmult_2048_html报告。图2 base_matmult_2048_html报告
此时获取的roofs的并行度为1(即串行),获取到Elapsed Time 62.699s,GFLOP Count 17.18,Performance 0.274 GFLOPS。
根据Roofline分析,由于物理内核是128个,而并行线程只有1个,因此可以增加并行数来实现调优。
父主题: 使用Roofline进行性能分析