Top-Down模型分析调优

原理

通过Perf命令或者Tuning Kit工具收集应用软件运行期间的Top-Down信息。

Top-Down结果示例：

++++++++++++++++++Top-down Microarchitecture Analysis Summary+++++++++++++++++
Total Execution Instruction:                    11705799905
Total Execution Cycles:                          9265629085
Instructions Per Cycle:                               1.263

Front-End Bound:                                   61.990%
Front-End Latency:                          31.796%
iTLB Miss:                         0.018%
L1 iTLB Miss:               0.018%
L2 iTLB Miss:               0.000%
iCache Miss:                         0.495%
L1 iCache Miss:              0.367%
L2 iCache Miss:              0.128%
BP_Misp_Flush:                      0.075%
OoO Rob Flush:                       0.000%
Static Predictor Flush:                   0.041%
Front End Bound Bandwidth:                     30.193%

Bad Speculation:                                        1.687%
Branch Mispredicts:                             0.292%
Indirect Branch:                        0.087%
Push Branch:                           0.004%
Pop Branch:                            0.081%
Other Branch:                          0.205%
Machine Clears:                                1.395%
Nuke Flush:                            0.220%
Other Flush:                           1.175%

Retiring:                                             31.584%

Back-End Bound:                                      4.739%
Resource Bound:                               3.691%
Sync_stall:                            0.000%
Rob_stall:                             0.089%
Ptag_stall:                            0.011%
SaveOpQ_stall:                        0.000%
PC_buffer_stall:                       0.433%
Core Bound:                                 59.466%
Divider:                             0.001%
FSU_stall:                           0.000%
Exe Ports Util:                       59.465%
ALU BRU IssueQ Full:        19.347%
LS IssueQ Full:               1.037%
FSU IssueQ Full:              0.000%
Memory Bound:                               35.199%
L1 Bound:                           18.958%
L2 Bount:                            0.045%
Intra Cluster Remote L2 Bound:          0.571%
Local LLC Bound:                     0.726%
Inter Cluster Local LLC Bound:          7.886%
Intra Chip Remote LLC Bound:          -3.371%
Inter Chip Remote LLC Bound:           8.203%
Intra Chip Local DDR Bound:            1.267%
Intra Chip Remote DDR Bound:          0.705%
Inter Chip Remote DDR Bound:          0.210%
Store Bound:                          0.000%
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

使用方法

Frontend bound（前端依赖）指的是通过CPU预加载、乱序执行技术获得的额外性能。
Backend bound（后端依赖）指的是传统的CPU负责处理事务的能力。由于这一个部分相对其他部分来说，受程序指令的影响更为突出，这一块又划分出了两个分类。core bound（核心依赖）意味着系统将会更多的依赖于微指令的处理能力。memory bound（存储依赖）这里的memory包含了CPU L1～L3缓存的能力和传统的内存性能。
Bad speculation（分支预测错误）这一部分指的是由于CPU乱序执行预测错误导致额外的系统开销。
Retiring（拆卸）字面理解是退休的意思，事实上这里指的是等待指令切换，模块重新初始化的开销。

在前端依赖中，如果iTLB miss率较高，可采用hugepage来进行优化，参考调整内存页大小。

如果分支预测开销较大，可以通过编译器参数-funroll-loops来进行优化，参考编译器调优；也可以对指令进行重新排布，尽量让跳转指令分布在不同的内存区间中；另外使用PGO/FDO进行预编译来提高分支预测的准确率。

在核心依赖中的开销占比可以通过增加计算核心来降低。存储依赖的开销中，如果L1及L2 cache miss率较高，可以更新编译器，使用编译优化参数来减少指令条数。如果是内存的话就可以通过升级内存带宽或者更高频、更低潜伏期的内存来获得性能提升。

父主题： 应用软件调优