Matrix Transpose Case

Viewing Metrics of a Matrix Whose Determinant Is 8192

As described in Parallel Case, the actual computing time for the 128 parallel threads is too short, and the 2048 determinant is insufficient for showing the function. To facilitate validity analysis, adopt the 8192 determinant to run related tasks.
Consider the most basic parallel computing (method 1) as the benchmark case. Based on the benchmark case, analyze the cases related to methods 2, 4, 5, and 6. (Method 3 is similar to method 2, and no additional collection is required.)
Select a proper matrix size based on the number of physical cores on the server. You can adjust the matrix size in ascending order to determine a proper value.

Run the parallel_matmult case whose matrix determinant is 8192.

        
             ./matmul 8192 1

Command output:

        
             Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
Initialization time = 2.751910s
Matrix multiplication time = 521.832686s

When the matrix determinant is 8192, the parallel computing takes approximately 521.8s.

Create a roofline task for the parallel_matmult case whose matrix determinant is 8192.

        
             devkit tuner roofline -o parallel_matmult_8192 -m region ./matmul 8192 1

Command output:

        
         
           
           
             Note:
    1. Roofline task is currently only supported on the 920 platform.
    2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application.
    3. Roofline task collection needs to ensure the application has finished running.
    4. The estimated time of roofline collection is about 3 * application estimated time.
    5. Roofline analysis is available only on physical machines.
    6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
RFCOLLECT: Start collection for ./matmul
RFCOLLECT: Launch application to collect performance metrics of ./matmul
Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
Initialization time = 2.793709s
ROOFLINE_EVENTS are initialized.
Matrix multiplication time = 516.902512s
RFCOLLECT: Launch application to do binary instrumentation of ./matmul
Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
Initialization time = 8.352283s
Matrix multiplication time = 496.935396s
RFCOLLECT: Launch benchmarks for measuring roofs
RFCOLLECT: Processing all collected data
RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json
RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report.

Get roofline report ...
The roofline json report: /matrix_multiplication/parallel_matmult_8192.json
The roofline html report: /matrix_multiplication/parallel_matmult_8192.html

            

          

        
       

View the parallel_matmult_8192.html report.
Figure 1 parallel_matmult_8192.html

In this case, Parallel Threads of roofs is 128, Elapsed Time is 516.824 seconds, GFLOP Count is 1099.512, and Performance is 2.127 GFLOPS.

The subsequent cases will use the parallel_matmult_8192.html report result as the benchmark for comparison.

Tuning Strategy

All points in Figure 1 are in the Memory Bound area.
The memory bottlenecks lie in DDR and L3, mainly in DDR.
The computation density (FLOP/BYTE) of L1 is the largest, that is, the least data is hit on L1.
Figure 2 Different data addressing sizes

Figure 3 Different data addressing methods
- Sequential load: reads each element of an array step by step. That means full cache line utilization is 100% (good).
- Stride load: Each fourth element of the array is read from the cache line. When cache line utilization is 25% (worse), L2 Bytes is greater than L1 Bytes. Therefore, L2 Flop/Byte is less than L1 Flop/Byte.
- Random load: For random access, there is no locality in cache access, and the cache line utilization is random and bad.

According to the preceding analysis, the performance bottleneck lies in the memory, that is, the L1 cache is underutilized.

Procedure

Transpose matrix B to ensure cache line aligned addresses, implement the sequential load mode, increase the cache hit ratio, and alleviate the performance bottleneck in Parallel Case.

Figure 4 Matrix transpose example

Figure 5 Matrix transpose code

Run the transpose_B_matmult case whose matrix determinant is 8192.

        
             ./matmul 8192 2

Command output:

        
             Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
Initialization time = 2.752044s
Matrix multiplication time = 12.562781s

When the matrix determinant is 8192, the parallel computing takes approximately 12.5s.

Create a roofline task for the transpose_B_matmult case whose matrix determinant is 8192.

        
             devkit tuner roofline  -m region -o transpose_B_matmult_8192 ./matmul 8192 2

Command output:

        
         
           
           
             Note:
  1. Roofline task is currently only supported on the 920 platform.
  2. The application must be a binary file in ELF format.
  3. Roofline task collection needs to ensure the application has finished running.
  4. The estimated time of roofline collection is about 3 * application estimated time.
  5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
RFCOLLECT: Start collection for ./matmul
RFCOLLECT: Launch application to collect performance metrics of ./matmul
Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
Initialization time = 2.793919s
ROOFLINE_EVENTS are initialized.
Matrix multiplication time = 10.419904s
RFCOLLECT: Launch application to do binary instrumentation of ./matmul
Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
Initialization time = 8.543225s
Matrix multiplication time = 11.596018s
RFCOLLECT: Launch benchmarks for measuring roofs
RFCOLLECT: Processing all collected data
RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json
RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report.

Get roofline report ...
The roofline json report: /matrix_multiplication/transpose_B_matmult_8192.json
The roofline html report: /matrix_multiplication/transpose_B_matmult_8192.html

            

          

        
       

View the transpose_B_matmult_8192.html report.
Figure 6 transpose_B_matmult_8192.html

In this case, Parallel Threads of roofs is 128, Elapsed Time is 10.017 seconds, GFLOP Count is 1099.512, and Performance is 109.763 GFLOPS.

Tuning Result

L1 is in the Memory Bound area, whereas L2, L3, and DDR are in the Compute and Memory Bound area.
The memory bottleneck lies in L1 and L3, whereas the high-speed cache is fully utilized.
The cache line utilization is improved. The computation density (FLOP/BYTE) sequence is: L1 < L2 < L3 < DDR.

**Table 1** Performance comparison
Case	Elapsed Time(s)	GFLOP Count	Performance	Performance Increase Ratio Per Unit Time (over the Previous Case)	End-to-End Performance Increase Ratio (over the Previous Case)	Performance Increase Ratio Per Unit Time (over the Benchmark Case)	End-to-End Performance Increase Ratio (over the Benchmark Case)
parallel_matmult_8192	516.824	1099.512	2.127	--	--	--	--
transpose_B_matmult_8192	10.017	1099.512	109.763	51.595	51.595	51.595	51.595

Going UP: The performance increases by 50.595 times.
Going RIGHT: Memory access is optimized to better improve the program performance.

Parent topic: Using Roofline for Performance Analysis