Analyzing and Tuning the Top-Down Model

Principle

Run the Perf command or use the Tuning Kit tool to collect the top-down information during the running of the application software.

Top-down result example:

++++++++++++++++++Top-down Microarchitecture Analysis Summary+++++++++++++++++
Total Execution Instruction:                    11705799905
Total Execution Cycles:                          9265629085
Instructions Per Cycle:                               1.263

Front-End Bound:                                   61.990%
Front-End Latency:                          31.796%
iTLB Miss:                         0.018%
L1 iTLB Miss:               0.018%
L2 iTLB Miss:               0.000%
iCache Miss:                         0.495%
L1 iCache Miss:              0.367%
L2 iCache Miss:              0.128%
BP_Misp_Flush:                      0.075%
OoO Rob Flush:                       0.000%
Static Predictor Flush:                   0.041%
Front End Bound Bandwidth:                     30.193%

Bad Speculation:                                        1.687%
Branch Mispredicts:                             0.292%
Indirect Branch:                        0.087%
Push Branch:                           0.004%
Pop Branch:                            0.081%
Other Branch:                          0.205%
Machine Clears:                                1.395%
Nuke Flush:                            0.220%
Other Flush:                           1.175%

Retiring:                                             31.584%

Back-End Bound:                                      4.739%
Resource Bound:                               3.691%
Sync_stall:                            0.000%
Rob_stall:                             0.089%
Ptag_stall:                            0.011%
SaveOpQ_stall:                        0.000%
PC_buffer_stall:                       0.433%
Core Bound:                                 59.466%
Divider:                             0.001%
FSU_stall:                           0.000%
Exe Ports Util:                       59.465%
ALU BRU IssueQ Full:        19.347%
LS IssueQ Full:               1.037%
FSU IssueQ Full:              0.000%
Memory Bound:                               35.199%
L1 Bound:                           18.958%
L2 Bount:                            0.045%
Intra Cluster Remote L2 Bound:          0.571%
Local LLC Bound:                     0.726%
Inter Cluster Local LLC Bound:          7.886%
Intra Chip Remote LLC Bound:          -3.371%
Inter Chip Remote LLC Bound:           8.203%
Intra Chip Local DDR Bound:            1.267%
Intra Chip Remote DDR Bound:          0.705%
Inter Chip Remote DDR Bound:          0.210%
Store Bound:                          0.000%
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Procedure

Frontend bound refers to the extra performance obtained through CPU preloading and out-of-order execution technologies.
Backend bound refers to the CPUs' traditional capability of processing transactions. This part is divided into two categories because it is more affected by program instructions than other parts. Core bound: It means that the system will depend more on the capability of processing the microinstructions. Memory bound: The memory here includes the CPU L1–L3 cache capability and traditional memory performance.
Bad speculation (incorrect branch prediction) refers to the extra system overhead that is caused by the prediction error of the CPU out-of-order execution.
Retiring refers to the overhead of waiting for instruction switching and module reinitialization.

In the frontend bound, if the iTLB miss ratio is high, you can use HugePages for optimization. For details, see Adjusting the Memory Page Size.

If the branch prediction overhead is large, you can use the compiler parameter -funroll-loops for optimization. For details, see Tuning the Compiler. Alternatively, you can rearrange the instructions, so that the jump instructions are distributed in different memory areas as much as possible. In addition, you can use PGO and feedback-directed optimization (FDO) for precompilation to improve the branch prediction accuracy.

The overhead percentage of the core bound can be reduced by increasing the number of computing cores. For the memory bound overhead, if the L1 and L2 cache miss ratios are high, you can update the compiler and use compilation optimization parameters to reduce the number of instructions. For memory, you can improve the performance by upgrading the memory bandwidth or using the memory with higher frequency and lower latency.

Parent topic: Application Software Tuning