Rate This Document
Findability
Accuracy
Completeness
Readability

Matrix Transpose, Block, and Vector Case

Shorten the calculation times presented in Parallel Case to Matrix Transpose and Block Case.

Tuning Strategy

For details about the SIMD instruction set, see SIMD Instruction Set.

Figure 1 SIMD instruction set
Figure 2 Matrix transpose, block, and vector code 1
Figure 3 Matrix transpose, block, and vector code 2

When cache line alignment addresses are ensured, transpose matrix B and select an appropriate block size to use vectorized instructions for optimization.

Procedure

  1. Run the intrinsics_transpose_B_matmult case.
    1
    ./matmul 8192 5
    

    Command output:

    1
    2
    3
    Size is 8192, Matrix multiplication method is: 5, Check correctness is: 0
    Initialization time = 2.787161s
    Matrix multiplication time = 2.600979s
    

    When the matrix determinant is 8192, the parallel computing takes approximately 2.6s.

  2. Create a roofline analysis task for intrinsics_transpose_B_matmult.

    Analyze the roofline task using the command line tool.

    1
    devkit tuner roofline -o intrinsics_transpose_B_matmult_8192 -m region ./matmul 8192 5
    

    Command output:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    Note:
        1. Roofline task is currently only supported on the 920 platform.
        2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application.
        3. Roofline task collection needs to ensure the application has finished running.
        4. The estimated time of roofline collection is about 3 * application estimated time.
        5. Roofline analysis is available only on physical machines.
        6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 8192, Matrix multiplication method is: 5, Check correctness is: 0
    Initialization time = 2.751606s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 2.741322s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 8192, Matrix multiplication method is: 5, Check correctness is: 0
    Initialization time = 8.353003s
    Matrix multiplication time = 2.519457s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-201408.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-201408.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/intrinsics_transpose_B_matmult_8192.json
    The roofline html report: /matrix_multiplication/intrinsics_transpose_B_matmult_8192.html
    
  3. View the intrinsics_transpose_B_matmult_8192.html report.
    Figure 4 intrinsics_transpose_B_matmult_8192.html

    In this case, Parallel Threads of roofs is 128, Elapsed Time is 2.652 seconds, GFLOP Count is 1717.987, and Performance is 647.781 GFLOPS.

Tuning Result

After intrinsics vectorized instructions are used, the computation process changes greatly. The computation amount increases by 47.1% (from 1168.231 GFLOP to 1717.987 GFLOP). The vectorized instructions greatly improve the end-to-end performance. For details, see the following table.

Table 1 Performance comparison

Case

Elapsed Time(s)

GFLOP Count

Performance

Performance Increase Ratio Per Unit Time (over the Previous Case)

End-to-End Performance Increase Ratio (over the Previous Case)

Performance Increase Ratio Per Unit Time (over the Benchmark Case)

End-to-End Performance Increase Ratio (over the Benchmark Case)

parallel_matmult_8192

516.824

1099.512

2.127

--

--

--

--

transpose_B_matmult_8192

10.017

1099.512

109.763

51.595

51.595

51.595

51.595

block_transpose_B_matmult_8192

3.646

1168.231

320.399

2.919

2.747

150.634

141.751

intrinsics_transpose_B_matmult_8192

2.652

1717.987

647.781

2.013

1.369

303.181

194.003

Compare block_transpose_B_matmult_8192 with intrinsics_transpose_B_matmult_8192.
  • Going UP: The performance is doubled and the actual end-to-end performance is improved by 1.37 times.
  • Going RIGHT: Less right shift. Vectorized instructions accelerate the computation process. The computation density (FLOP/BYTE) changes slightly (as expected).