KML Case
Shorten the calculation times presented in Parallel Case to Matrix Transpose, Block, and Vector Case.
Tuning Strategy
Optimize program performance with the Kunpeng Math Library (KML).
Procedure
- Run the kml_matmult case whose matrix determinant is 8192.
1./matmul 8192 6
Command output:
1 2 3
Size is 8192, Matrix multiplication method is: 6, Check correctness is: 0 Initialization time = 2.789213s Matrix multiplication time = 0.271790s
When the matrix determinant is 8192, the parallel computing takes approximately 0.27s.
- Create a roofline analysis task for the kml_matmult 8192 case.
1devkit tuner roofline -o kml_matmult_8192 -m region ./matmul 8192 6
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Note: 1. Roofline task is currently only supported on the 920 platform. 2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application. 3. Roofline task collection needs to ensure the application has finished running. 4. The estimated time of roofline collection is about 3 * application estimated time. 5. Roofline analysis is available only on physical machines. 6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD RFCOLLECT: Start collection for ./matmul RFCOLLECT: Launch application to collect performance metrics of ./matmul Size is 8192, Matrix multiplication method is: 6, Check correctness is: 0 Initialization time = 2.794584s ROOFLINE_EVENTS are initialized. Matrix multiplication time = 0.432760s RFCOLLECT: Launch application to do binary instrumentation of ./matmul Size is 8192, Matrix multiplication method is: 6, Check correctness is: 0 Initialization time = 8.353567s Matrix multiplication time = 0.283024s RFCOLLECT: Launch benchmarks for measuring roofs RFCOLLECT: Processing all collected data RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-203926.json RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-203926.json" to get report. Get roofline report ... The roofline json report: /matrix_multiplication/kml_matmult_8192.json The roofline html report: /matrix_multiplication/kml_matmult_8192.html
- View the kml_matmult_8192.html report.Figure 2 kml_matmult_8192.html
In this case, Parallel Threads of roofs is 128, Elapsed Time is 0.329 seconds, GFLOP Count is 1100.518, and Performance is 3345.372 GFLOPS.
Tuning Result
After the KML is used, the computation amount is restored to the original value (the computation amount remains unchanged after optimization based on mathematical derivation). Math library optimization greatly improves the program performance. Therefore, the end-to-end performance is greatly improved. For details, see the following table.
Compared with the original parallel computing, the KML shortens the end-to-end program execution time from 516.824s to 0.329s, which means performance improvement by 1570 times.
|
Case |
Elapsed Time(s) |
GFLOP Count |
Performance |
Performance Increase Ratio Per Unit Time (over the Previous Case) |
End-to-End Performance Increase Ratio (over the Previous Case) |
Performance Increase Ratio Per Unit Time (over the Benchmark Case) |
End-to-End Performance Increase Ratio (over the Benchmark Case) |
|---|---|---|---|---|---|---|---|
|
parallel_matmult_8192 |
516.824 |
1099.512 |
2.127 |
-- |
-- |
-- |
-- |
|
transpose_B_matmult_8192 |
10.017 |
1099.512 |
109.763 |
51.595 |
51.595 |
51.595 |
51.595 |
|
block_transpose_B_matmult_8192 |
3.646 |
1168.231 |
320.399 |
2.919 |
2.747 |
150.634 |
141.751 |
|
intrinsics_transpose_B_matmult_8192 |
2.652 |
1717.987 |
647.781 |
2.013 |
1.369 |
303.181 |
194.003 |
|
kml_matmult_8192 |
0.329 |
1100.518 |
3345.372 |
5.188 |
8.097 |
1572.812 |
1570.894 |
- Going UP: The performance increases by over 3 times.
- Going RIGHT: Less right shift. Vectorized instructions accelerate the computation process. The computation density (FLOP/BYTE) changes slightly (as expected).