鲲鹏社区首页
中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

串行示例

运行最基本的串行示例,由于串行示例耗时较长,选取行列大小均为2048的矩阵进行任务分析。

该部分代码通过串行方式实现矩阵相乘。

图1 串行示例代码
  1. 运行矩阵行列大小为2048的base_matmult的示例。
    1
    ./matmul 2048 0 
    

    返回信息如下:

    1
    2
    3
    Size is 2048, Matrix multiplication method is: 0, Check correctness is: 0 
    Initialization time = 0.174492s 
    Matrix multiplication time = 62.657254s
    

    矩阵行列大小为2048情况下,串行计算耗时62秒左右。

  2. 创建矩阵行列大小为2048的base_matmult的Roofline任务。

    使用命令行工具进行roofline任务分析。

    1
    devkit tuner roofline -o base_matmult_2048 -m region ./matmul 2048 0
    

    示例中均使用鲲鹏DevKit命令行模式,也可使用Web模式的Roofline任务进行分析。

    返回信息如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    Note:
        1. Roofline task is currently only supported on the 920 platform.
        2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application.
        3. Roofline task collection needs to ensure the application has finished running.
        4. The estimated time of roofline collection is about 3 * application estimated time.
        5. Roofline analysis is available only on physical machines.
        6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 2048, Matrix multiplication method is: 0, Check correctness is: 0
    Initialization time = 0.174628s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 62.718666s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 2048, Matrix multiplication method is: 0, Check correctness is: 0
    Initialization time = 0.528283s
    Matrix multiplication time = 85.328236s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-151117.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-151117.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/base_matmult_2048.json
    The roofline html report: /matrix_multiplication/base_matmult_2048.html
    
  3. 查看base_matmult_2048_html报告。
    图2 base_matmult_2048_html报告

    此时获取的roofs的并行度为1(即串行),获取到Elapsed Time 62.699s,GFLOP Count 17.18,Performance 0.274 GFLOPS。

    根据Roofline分析,由于物理内核是128个,而并行线程只有1个,因此可以增加并行数来实现调优。