Rate This Document
Findability
Accuracy
Completeness
Readability

Parallel Case

Symptom

There are 128 physical cores but only one parallel thread. The program running takes a long time.

Tuning Strategy

Use the basic OpenMP programming method to implement parallel computing. In a multi-core environment, improving the degree of parallelism is a direct and effective tuning measure.

Procedure

Figure 1 Parallel case code
  1. Run the parallel_matmult case whose matrix determinant is 2048.
    1
    ./matmul 2048 1
    

    Command output:

    1
    2
    3
    Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 0.175117s
    Matrix multiplication time = 2.971563s
    

    When the matrix determinant is 2048, the parallel computing takes approximately 3s.

  2. Create a roofline task for the parallel_matmult case whose matrix determinant is 2048.
    1
    devkit tuner roofline -o parallel_matmult_2048 -m region ./matmul 2048 1
    

    Command output:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    Note:
        1. Roofline task is currently only supported on the 920 platform.
        2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application.
        3. Roofline task collection needs to ensure the application has finished running.
        4. The estimated time of roofline collection is about 3 * application estimated time.
        5. Roofline analysis is available only on physical machines.
        6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 0.174164s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 2.996051s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 0.523171s
    Matrix multiplication time = 3.427321s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-154009.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-154009.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/parallel_matmult_2048.json
    The roofline html report: /matrix_multiplication/parallel_matmult_2048.html
    
  3. View the parallel_matmult_2048.html report.
    Figure 2 parallel_matmult_2048.html

    In this case, Parallel Threads of roofs is 128, Elapsed Time is 2.953 seconds, GFLOP Count is 17.18, and Performance is 5.818 GFLOPS.

Tuning Result

In a multi-core environment, improving the degree of parallelism is a direct and effective tuning measure.

Table 1 Performance comparison

Case

Elapsed Time(s)

GFLOP Count

Performance

Performance Increase Ratio Per Unit Time (over the Previous Case)

End-to-End Performance Increase Ratio (over the Previous Case)

base_matmult_2048

62.699

17.18

0.274

--

--

parallel_matmult_2048

2.953

17.18

5.818

21.232

21.232