Serial Case
Run the most basic serial test case. Because a serial test case takes a relatively long time, choose a matrix whose determinant is 2048 for task analysis.
The following code implements matrix multiplication in serial mode.
- Run the base_matmult case whose matrix determinant is 2048.
1./matmul 2048 0
Command output:
1 2 3
Size is 2048, Matrix multiplication method is: 0, Check correctness is: 0 Initialization time = 0.174492s Matrix multiplication time = 62.657254s
When the matrix determinant is 2048, the serial computing takes approximately 62s.
- Create a roofline task for the base_matmult case whose matrix determinant is 2048.
Analyze the roofline task using the command line tool.
1devkit tuner roofline -o base_matmult_2048 -m region ./matmul 2048 0
The examples use the Kunpeng DevKit in CLI mode.
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Note: 1. Roofline task is currently only supported on the 920 platform. 2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application. 3. Roofline task collection needs to ensure the application has finished running. 4. The estimated time of roofline collection is about 3 * application estimated time. 5. Roofline analysis is available only on physical machines. 6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD RFCOLLECT: Start collection for ./matmul RFCOLLECT: Launch application to collect performance metrics of ./matmul Size is 2048, Matrix multiplication method is: 0, Check correctness is: 0 Initialization time = 0.174628s ROOFLINE_EVENTS are initialized. Matrix multiplication time = 62.718666s RFCOLLECT: Launch application to do binary instrumentation of ./matmul Size is 2048, Matrix multiplication method is: 0, Check correctness is: 0 Initialization time = 0.528283s Matrix multiplication time = 85.328236s RFCOLLECT: Launch benchmarks for measuring roofs RFCOLLECT: Processing all collected data RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-151117.json RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-151117.json" to get report. Get roofline report ... The roofline json report: /matrix_multiplication/base_matmult_2048.json The roofline html report: /matrix_multiplication/base_matmult_2048.html
- View the base_matmult_2048.html report.Figure 2 base_matmult_2048.html
In this case, Parallel Threads of roofs is 1 (serial), Elapsed Time is 62.699 seconds, GFLOP Count is 17.18, and Performance is 0.274 GFLOPS.
According to the roofline analysis, there are 128 physical cores but only one parallel thread. Therefore, you can increase the number of parallel threads to tune the program performance.