Matrix Transpose Case
Viewing Metrics of a Matrix Whose Determinant Is 8192
- As described in Parallel Case, the actual computing time for the 128 parallel threads is too short, and the 2048 determinant is insufficient for showing the function. To facilitate validity analysis, adopt the 8192 determinant to run related tasks.
- Consider the most basic parallel computing (method 1) as the benchmark case. Based on the benchmark case, analyze the cases related to methods 2, 4, 5, and 6. (Method 3 is similar to method 2, and no additional collection is required.)
- Select a proper matrix size based on the number of physical cores on the server. You can adjust the matrix size in ascending order to determine a proper value.
- Run the parallel_matmult case whose matrix determinant is 8192.
1./matmul 8192 1
Command output:
1 2 3
Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0 Initialization time = 2.751910s Matrix multiplication time = 521.832686s
When the matrix determinant is 8192, the parallel computing takes approximately 521.8s.
- Create a roofline task for the parallel_matmult case whose matrix determinant is 8192.
1devkit tuner roofline -o parallel_matmult_8192 -m region ./matmul 8192 1
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Note: 1. Roofline task is currently only supported on the 920 platform. 2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application. 3. Roofline task collection needs to ensure the application has finished running. 4. The estimated time of roofline collection is about 3 * application estimated time. 5. Roofline analysis is available only on physical machines. 6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD RFCOLLECT: Start collection for ./matmul RFCOLLECT: Launch application to collect performance metrics of ./matmul Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0 Initialization time = 2.793709s ROOFLINE_EVENTS are initialized. Matrix multiplication time = 516.902512s RFCOLLECT: Launch application to do binary instrumentation of ./matmul Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0 Initialization time = 8.352283s Matrix multiplication time = 496.935396s RFCOLLECT: Launch benchmarks for measuring roofs RFCOLLECT: Processing all collected data RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report. Get roofline report ... The roofline json report: /matrix_multiplication/parallel_matmult_8192.json The roofline html report: /matrix_multiplication/parallel_matmult_8192.html
- View the parallel_matmult_8192.html report.
In this case, Parallel Threads of roofs is 128, Elapsed Time is 516.824 seconds, GFLOP Count is 1099.512, and Performance is 2.127 GFLOPS.
The subsequent cases will use the parallel_matmult_8192.html report result as the benchmark for comparison.
Tuning Strategy
- All points in Figure 1 are in the Memory Bound area.
- The memory bottlenecks lie in DDR and L3, mainly in DDR.
- The computation density (FLOP/BYTE) of L1 is the largest, that is, the least data is hit on L1.
Figure 2 Different data addressing sizes
Figure 3 Different data addressing methods
- Sequential load: reads each element of an array step by step. That means full cache line utilization is 100% (good).
- Stride load: Each fourth element of the array is read from the cache line. When cache line utilization is 25% (worse), L2 Bytes is greater than L1 Bytes. Therefore, L2 Flop/Byte is less than L1 Flop/Byte.
- Random load: For random access, there is no locality in cache access, and the cache line utilization is random and bad.
According to the preceding analysis, the performance bottleneck lies in the memory, that is, the L1 cache is underutilized.
Procedure
Transpose matrix B to ensure cache line aligned addresses, implement the sequential load mode, increase the cache hit ratio, and alleviate the performance bottleneck in Parallel Case.
- Run the transpose_B_matmult case whose matrix determinant is 8192.
1./matmul 8192 2
Command output:
1 2 3
Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0 Initialization time = 2.752044s Matrix multiplication time = 12.562781s
When the matrix determinant is 8192, the parallel computing takes approximately 12.5s.
- Create a roofline task for the transpose_B_matmult case whose matrix determinant is 8192.
1devkit tuner roofline -m region -o transpose_B_matmult_8192 ./matmul 8192 2
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Note: 1. Roofline task is currently only supported on the 920 platform. 2. The application must be a binary file in ELF format. 3. Roofline task collection needs to ensure the application has finished running. 4. The estimated time of roofline collection is about 3 * application estimated time. 5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD RFCOLLECT: Start collection for ./matmul RFCOLLECT: Launch application to collect performance metrics of ./matmul Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0 Initialization time = 2.793919s ROOFLINE_EVENTS are initialized. Matrix multiplication time = 10.419904s RFCOLLECT: Launch application to do binary instrumentation of ./matmul Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0 Initialization time = 8.543225s Matrix multiplication time = 11.596018s RFCOLLECT: Launch benchmarks for measuring roofs RFCOLLECT: Processing all collected data RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report. Get roofline report ... The roofline json report: /matrix_multiplication/transpose_B_matmult_8192.json The roofline html report: /matrix_multiplication/transpose_B_matmult_8192.html
- View the transpose_B_matmult_8192.html report.Figure 6 transpose_B_matmult_8192.html
In this case, Parallel Threads of roofs is 128, Elapsed Time is 10.017 seconds, GFLOP Count is 1099.512, and Performance is 109.763 GFLOPS.
Tuning Result
- L1 is in the Memory Bound area, whereas L2, L3, and DDR are in the Compute and Memory Bound area.
- The memory bottleneck lies in L1 and L3, whereas the high-speed cache is fully utilized.
- The cache line utilization is improved. The computation density (FLOP/BYTE) sequence is: L1 < L2 < L3 < DDR.
|
Case |
Elapsed Time(s) |
GFLOP Count |
Performance |
Performance Increase Ratio Per Unit Time (over the Previous Case) |
End-to-End Performance Increase Ratio (over the Previous Case) |
Performance Increase Ratio Per Unit Time (over the Benchmark Case) |
End-to-End Performance Increase Ratio (over the Benchmark Case) |
|---|---|---|---|---|---|---|---|
|
parallel_matmult_8192 |
516.824 |
1099.512 |
2.127 |
-- |
-- |
-- |
-- |
|
transpose_B_matmult_8192 |
10.017 |
1099.512 |
109.763 |
51.595 |
51.595 |
51.595 |
51.595 |
- Going UP: The performance increases by 50.595 times.
- Going RIGHT: Memory access is optimized to better improve the program performance.
