我要评分
获取效率
正确性
完整性
易理解

Matrix Transpose Case

Viewing Metrics of a Matrix Whose Determinant Is 8192

  1. As described in Parallel Case, the actual computing time for the 128 parallel threads is too short, and the 2048 determinant is insufficient for showing the function. To facilitate validity analysis, adopt the 8192 determinant to run related tasks.
  2. Consider the most basic parallel computing (method 1) as the benchmark case. Based on the benchmark case, analyze the cases related to methods 2, 4, 5, and 6. (Method 3 is similar to method 2, and no additional collection is required.)
  3. Select a proper matrix size based on the number of physical cores on the server. You can adjust the matrix size in ascending order to determine a proper value.
  1. Run the parallel_matmult case whose matrix determinant is 8192.
    1
    ./matmul 8192 1
    

    Command output:

    1
    2
    3
    Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 2.751910s
    Matrix multiplication time = 521.832686s
    

    When the matrix determinant is 8192, the parallel computing takes approximately 521.8s.

  2. Create a roofline task for the parallel_matmult case whose matrix determinant is 8192.
    1
    devkit tuner roofline -o parallel_matmult_8192 -m region ./matmul 8192 1
    

    Command output:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    Note:
        1. Roofline task is currently only supported on the 920 platform.
        2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application.
        3. Roofline task collection needs to ensure the application has finished running.
        4. The estimated time of roofline collection is about 3 * application estimated time.
        5. Roofline analysis is available only on physical machines.
        6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 2.793709s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 516.902512s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 8192, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 8.352283s
    Matrix multiplication time = 496.935396s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/parallel_matmult_8192.json
    The roofline html report: /matrix_multiplication/parallel_matmult_8192.html
    
  3. View the parallel_matmult_8192.html report.
    Figure 1 parallel_matmult_8192.html

    In this case, Parallel Threads of roofs is 128, Elapsed Time is 516.824 seconds, GFLOP Count is 1099.512, and Performance is 2.127 GFLOPS.

    The subsequent cases will use the parallel_matmult_8192.html report result as the benchmark for comparison.

Tuning Strategy

  1. All points in Figure 1 are in the Memory Bound area.
  2. The memory bottlenecks lie in DDR and L3, mainly in DDR.
  3. The computation density (FLOP/BYTE) of L1 is the largest, that is, the least data is hit on L1.
    Figure 2 Different data addressing sizes
    Figure 3 Different data addressing methods
    • Sequential load: reads each element of an array step by step. That means full cache line utilization is 100% (good).
    • Stride load: Each fourth element of the array is read from the cache line. When cache line utilization is 25% (worse), L2 Bytes is greater than L1 Bytes. Therefore, L2 Flop/Byte is less than L1 Flop/Byte.
    • Random load: For random access, there is no locality in cache access, and the cache line utilization is random and bad.

According to the preceding analysis, the performance bottleneck lies in the memory, that is, the L1 cache is underutilized.

Procedure

Transpose matrix B to ensure cache line aligned addresses, implement the sequential load mode, increase the cache hit ratio, and alleviate the performance bottleneck in Parallel Case.

Figure 4 Matrix transpose example
Figure 5 Matrix transpose code
  1. Run the transpose_B_matmult case whose matrix determinant is 8192.
    1
    ./matmul 8192 2
    

    Command output:

    1
    2
    3
    Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
    Initialization time = 2.752044s
    Matrix multiplication time = 12.562781s
    

    When the matrix determinant is 8192, the parallel computing takes approximately 12.5s.

  2. Create a roofline task for the transpose_B_matmult case whose matrix determinant is 8192.
    1
    devkit tuner roofline  -m region -o transpose_B_matmult_8192 ./matmul 8192 2
    

    Command output:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    Note:
      1. Roofline task is currently only supported on the 920 platform.
      2. The application must be a binary file in ELF format.
      3. Roofline task collection needs to ensure the application has finished running.
      4. The estimated time of roofline collection is about 3 * application estimated time.
      5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
    Initialization time = 2.793919s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 10.419904s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 8192, Matrix multiplication method is: 2, Check correctness is: 0
    Initialization time = 8.543225s
    Matrix multiplication time = 11.596018s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-160802.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-160802.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/transpose_B_matmult_8192.json
    The roofline html report: /matrix_multiplication/transpose_B_matmult_8192.html
    
  3. View the transpose_B_matmult_8192.html report.
    Figure 6 transpose_B_matmult_8192.html

    In this case, Parallel Threads of roofs is 128, Elapsed Time is 10.017 seconds, GFLOP Count is 1099.512, and Performance is 109.763 GFLOPS.

Tuning Result

  1. L1 is in the Memory Bound area, whereas L2, L3, and DDR are in the Compute and Memory Bound area.
  2. The memory bottleneck lies in L1 and L3, whereas the high-speed cache is fully utilized.
  3. The cache line utilization is improved. The computation density (FLOP/BYTE) sequence is: L1 < L2 < L3 < DDR.
Table 1 Performance comparison

Case

Elapsed Time(s)

GFLOP Count

Performance

Performance Increase Ratio Per Unit Time (over the Previous Case)

End-to-End Performance Increase Ratio (over the Previous Case)

Performance Increase Ratio Per Unit Time (over the Benchmark Case)

End-to-End Performance Increase Ratio (over the Benchmark Case)

parallel_matmult_8192

516.824

1099.512

2.127

--

--

--

--

transpose_B_matmult_8192

10.017

1099.512

109.763

51.595

51.595

51.595

51.595

  • Going UP: The performance increases by 50.595 times.
  • Going RIGHT: Memory access is optimized to better improve the program performance.