Rate This Document
Findability
Accuracy
Completeness
Readability

Matrix Transpose and Block Case

Shorten the calculation times presented in Parallel Case and Matrix Transpose Case.

Tuning Strategy

  1. Transpose matrix B to ensure cache line aligned addresses, and iterate blocks in small areas to reduce cache misses.
  2. Determine the block size based on the actual cache size and environment configuration.
Figure 1 Matrix transpose
Figure 2 Matrix transpose and block code

Procedure

  1. Run the block_transpose_B_matmult case whose matrix determinant is 8192.
    1
    ./matmul 8192 4
    

    Command output:

    1
    2
    3
    Size is 8192, Matrix multiplication method is: 4, Check correctness is: 0
    Initialization time = 2.787273s
    Matrix multiplication time = 3.711554s
    

    When the matrix determinant is 8192, the parallel computing takes approximately 3.7s.

  2. Create a roofline task for the block_transpose_B_matmult case whose matrix determinant is 8192.
    1
    devkit tuner roofline -o block_transpose_B_matmult_8192 -m region ./matmul 8192 4
    

    Command output:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    Note:
        1. Roofline task is currently only supported on the 920 platform.
        2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application.
        3. Roofline task collection needs to ensure the application has finished running.
        4. The estimated time of roofline collection is about 3 * application estimated time.
        5. Roofline analysis is available only on physical machines.
        6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 8192, Matrix multiplication method is: 4, Check correctness is: 0
    Initialization time = 2.794598s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 3.743286s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 8192, Matrix multiplication method is: 4, Check correctness is: 0
    Initialization time = 8.353251s
    Matrix multiplication time = 3.849523s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-195201.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-195201.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/block_transpose_B_matmult_8192.json
    The roofline html report: /matrix_multiplication/block_transpose_B_matmult_8192.html
    
  3. View the block_transpose_B_matmult_8192.html report.
    Figure 3 block_transpose_B_matmult_8192.html

    In this case, Parallel Threads of roofs is 128, Elapsed Time is 3.646 seconds, GFLOP Count is 1168.231, and Performance is 320.399 GFLOPS.

Tuning Result

The internal block loop has extra computation amount of 6.25% and the GFLOP count increases from 1099.512 to 1168.231. However, the overall performance increases by 191.9% and the end-to-end performance is greatly improved. See the following table.

  1. The L1, L2, L3, and DDR points are located in the Compute and Memory Bound area in Figure 3.
  2. The memory bottlenecks lie in L1 and L2, mainly in L2.
  3. The cache line utilization is improved over Matrix Transpose Case. The computation density (FLOP/BYTE) sequence is: L1 ≈ L2 < L3 < DDR.
Table 1 Performance comparison

Case

Elapsed Time(s)

GFLOP Count

Performance

Performance Increase Ratio Per Unit Time (over the Previous Case)

End-to-End Performance Increase Ratio (over the Previous Case)

Performance Increase Ratio Per Unit Time (over the Benchmark Case)

End-to-End Performance Increase Ratio (over the Benchmark Case)

parallel_matmult_8192

516.824

1099.512

2.127

--

--

--

--

transpose_B_matmult_8192

10.017

1099.512

109.763

51.595

51.595

51.595

51.595

block_transpose_B_matmult_8192

3.646

1168.231

320.399

2.919

2.747

150.634

141.751

Compare transpose_B_matmult_8192 with block_transpose_B_matmult_8192.
  • Going UP: The performance increases by 1.919 times.
  • Going RIGHT:
    1. The computation densities at all points have increased (good).
    2. DDR points are farther away from cache points, and the memory access bottleneck is further alleviated.
    3. Cache misses and DDR load operations are reduced, and the cache are more reused. These mean obvious performance optimization.
    4. The DDR memory is close to the Compute Bound area, and computation optimization is possible.