EN
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

KML示例

对于之前并行示例矩阵转置&block&vector示例运行时间进行优化。

调优思路

通过鲲鹏数学库KML(Kunpeng Math Library)进行优化。

图1 KML代码

接口描述详见鲲鹏数学库文档

操作步骤

  1. 运行矩阵行列大小为8192的kml_matmult示例。
    ./matmul 8192 6

    返回信息如下:

    Size is 8192, Matrix multiplication method is: 6, Check correctness is: 0
    Initialization time = 2.789213s
    Matrix multiplication time = 0.271790s

    矩阵行列大小为8192,并行计算耗时0.27秒左右。

  2. 创建kml_matmult 8192示例的Roofline任务。
    devkit tuner roofline -o kml_matmult_8192 -m region ./matmul 8192 6

    返回信息如下:

    Note:
      1. Roofline task is currently only supported on the 920 platform.
      2. The application must be a binary file in ELF format.
      3. Roofline task collection needs to ensure the application has finished running.
      4. The estimated time of roofline collection is about 3 * application estimated time.
      5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 8192, Matrix multiplication method is: 6, Check correctness is: 0
    Initialization time = 2.794584s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 0.432760s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 8192, Matrix multiplication method is: 6, Check correctness is: 0
    Initialization time = 8.353567s
    Matrix multiplication time = 0.283024s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-203926.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-203926.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/kml_matmult_8192.json
    The roofline html report: /matrix_multiplication/kml_matmult_8192.html
  3. 查看kml_matmult_8192报告。
    图2 kml_matmult_8192报告

    获取的roofs的并行度为128,获取到Elapsed Time 0.329s,GFLOP Count 1100.518,Performance 3345.372 GFLOPS。

优化效果

使用KML数学库加速后,计算量恢复到了原有的值(基于数学推导优化,计算量基本不变),数据库优化带来了极大的性能提升,因此端到端性能有了极大的提升,详见下表。

从最基础的并行计算,到最终使用数学库优化, 可以看到程序的端到端时间从516.824秒优化到了0.329秒,整体性能提升1570倍。

表1

case

Elapsed Time(s)

GFLOP Count

Performance

单位时间性能倍率(相比于前一case)

端到端性能倍率(相比于前一case)

单位时间性能倍率(相比于基准case)

端到端性能倍率(相比于基准case)

parallel_matmult_8192

516.824

1099.512

2.127

--

--

--

--

transpose_B_matmult_8192

10.017

1099.512

109.763

51.595

51.595

51.595

51.595

block_transpose_B_matmult_8192

3.646

1168.231

320.399

2.919

2.747

150.634

141.751

intrinsics_transpose_B_matmult_8192

2.664

1717.987

647.781

2.013

1.369

303.181

194.003

kml_matmult_8192

0.329

1100.518

3345.372

5.188

8.097

1572.812

1570.894

图3 对比分析
  • Going UP:Performance提升了3倍多。
  • Going RIGHT:较少的右移, 使用向量化指令让计算变得更快,计算密度(FLOP/Byte)没有发生太大的变化 (预期中)。