Viewing Analysis Results

Prerequisites

An HPC application analysis task has been created and the analysis is complete.

Procedure

In the System Profiler area on the left, click the name of the target analysis task.
The node list is displayed.

Click the name of the target node to view the analysis results.

Click the node name. The Summary tab page is displayed by default, as shown in Figure 1. Table 1 describes the parameters.

You can click Know-how under Tuning Suggestions or in the lower right corner to see reference tuning operations.

Figure 1 Summary

**Table 1** Parameters on the Summary tab page
Parameter	Description
Elapsed Time	Execution time of an application.
Serial Time	Serial running time of an application.
Parallel Time	How long an application is running in parallel.
Imbalance	Running time of the application that is unbalanced.
CPU Utilization	CPU usage, that is, the ratio of the CPU usage to the running OpenMP.
OpenMP Team Usage	Usage of the OpenMP Team.
Function	Invoked functions.
Module	Invoked module.
CPU Time (s)	CPU usage time.
Inst Retired	Number of retired instructions.
Parallel region	Parallel region.
Potential Gain (s)	Difference between the actual duration and the theoretical duration.
Imbalance Ratio (%)	Rate of running applications that are imbalanced.
Average Time (ms)	Average running time.
CPI	Ratio of CPU cycles/Retired instructions, which indicates the clock cycle consumed by each instruction.
Effective Utilization	CPU usage of the thread effective working.
Spinning	CPU usage occupied by the thread waiting for spinlock.
Overhead	CPU usage of other overheads.
Instruction Retired	Number of retired instructions.
MPI Wait Rate	Percentage of time spent on the MPI block function.
Communication	Percentage of cluster communications to total communications.
Point to point	Percentage of time spent on the point-to-point communication function.
Collective	Percentage of time spent on the MPI collection function.
Synchronization	Percentage of time spent on the synchronization function.

**Table 2** Parameters in the Hotspots area
Parameter	Description
Grouping Mode	By default, Function is displayed. You can also select Module, parallel-region, or barrier-to-barrier-segment.
function	Invoked functions.
module	Invoked module.
parallel-region	Parallel region.
barrier-to-barrier-segment	Special stand-alone section.
in Loop	Loop data. This parameter is displayed only when Function is selected for Grouping Mode.
CPU (%)	CPU usage.
CPU (s)	CPU time.
Spin (s)	CPU time for waiting for spinlock.
Overhead (s)	CPU time occupied by other overheads.
CPI	Ratio of CPU cycles/Retired instructions, which indicates the clock cycle consumed by each instruction.
Ret (%)	CPU microarchitecture execution efficiency. The calculation formula is INST_RETIRED / (4 x CPU_CYCLES).
Back (%)	Percentage of CPU pipeline execution pauses caused by insufficient resources such as core and memory.
Mem (%)	Percentage of CPU pipeline execution pauses caused by memory access latency.
L1 (%)	Percentage of CPU pipeline execution pauses caused by L1 cache hits.
L2 (%)	Percentage of CPU pipeline execution pauses caused by L2 cache hits.
L3/M (%)	Percentage of CPU pipeline execution pauses caused by L2 cache misses.
Core (%)	Percentage of CPU pipeline execution pauses due to instructions being executed.
SIMD (%)	Percentage of SIMD instructions.
Front (%)	Percentage of CPU pipeline execution pauses caused by front-end components.
Spec (%)	Percentage of CPU pipeline execution pauses caused by branch prediction execution.
Instr	Number of instructions.

**Table 3** Parameters in the Memory Bandwidth area
Parameter	Description
Memory Bandwidth
Average DRAM Bandwidth	Average DRAM bandwidth.
Read Bandwidth	Average read bandwidth.
Write Bandwidth	Average write bandwidth.
Intra-Socket Bandwidth	Bandwidth of a socket.
Cross-Socket Bandwidth	Cross-socket bandwidth.
L3 By-Pass Rate	L3 bypass rate.
L3 Miss Rate	L3 miss rate.
L3 Usage	L3 cluster usage.
Command distribution (hover your mouse pointer to the question mark next to a parameter to view details)

**Table 4** Parameters in the HPC Top-Down and PMU Events areas
Parameter	Description
HPC Top-Down
Event Name	Name of the top-down event.
Event Percentage	Proportion of the top-down event.
Number of original PMU events
Miss Events	Name of the PMU event.
Count	Number of PMU events.

**Table 5** MPI runtime metrics
Parameter	Description
Grouping mode	Filter type. By default, function is selected. You can also select send-type, recv-type, mpi-comm, caller, send-size or recv-size.
function	Invoked functions.
MPI Rank	Logical working unit.
Wait Rate (%)	Percentage of time spent on the MPI block function.
P2P Comm (%)	Percentage of time spent on the MPI point-to-point communication function.
Coll Comm (%)	Percentage of time spent on the MPI collection function.
Sync (%)	Percentage of time spent on the MPI synchronization function.
Single I/O (%)	Percentage of time spent on the MPI_File_read and MPI_File_write functions.
Coll I/O (%)	Percentage of time spent on the MPI_File_read_all and MPI_File_write_all functions.
Avg Time	Average latency.
Call Count	Number of calls.
Data Size (bytes)	Size of transmitted data.
Send data type	Type of sent data.
Recv data type	Type of received data.
Sent	Working unit that sends data.
Received	Working unit that receives data.

**Table 6** OpenMP runtime metrics
Parameter	Description
Parallel region	Parallel region.
Barrier-to-barrier segment	Special stand-alone section.
Potential Gain (s)	Difference between the ideal and real wall time of parallel region.
Elapsed Time (s)	Wall time of the parallel region.
Imbalance (s)	Wall time lost because threads are waiting each other at the end of parallel region.
Imb (%)	Ratio of the execution time of unbalanced applications to the total execution time.
CPU Util (%)	CPU usage in the parallel region.
Avg (ms)	Average latency.
Count	Number of calls.
Lock Cont (s)	CPU time of a worker thread on a lock that consumes CPU resources.
Creation (s)	Overhead of a parallel work assignment.
Scheduling (s)	OpenMP runtime scheduler overhead on a parallel work assignment for working threads.
Tasking (s)	Time when the task is assigned.
Reduction (s)	Runtime overhead on performing reduction operations.
Atomics (s)	Runtime overhead on performing atomic operations.

Click the MPI Node tab to view the execution information about each node task. The top N MPI hot nodes in a cluster with more than 200,000 cores can be analyzed. See Figure 2. Table 7 describes the parameters.

Figure 2 MPI Node page

**Table 7** MPI node parameters
Parameter	Description
Node IP Address	IP addresses of all nodes.
CPU Usage (%)	CPU usage of the node.
CPI	Ratio of CPU cycles/Retired instructions, which indicates the clock cycle consumed by each instruction.
Average DRAM Bandwidth (GB/s)	Average DRAM bandwidth.
Intra-Socket Bandwidth (GB/s)	Bandwidth of a socket.
Cross-Socket Bandwidth (GB/s)	Cross-socket bandwidth.
MPI wait rate	Percentage of time spent on the MPI block function.
menused (KB)	Used memory of a node.
memfree (KB)	Available memory of a node.
rd(KB)/s	Bandwidth of reading data from the device per second.
wr(KB)/s	Bandwidth of writing data to the device per second.
rxkB/s	Total data received per second, in KB.
txkB/s	Total data transmitted per second, in KB.
Average Power (W)	Average power of the system.

Figure 3 shows the OpenMP timeline tab page. Table 8 describes the parameters.

You can use "←" and "→" to switch between threads. Key threads are marked with . Drag the time axis to view the data in the corresponding time range or select key threads you want to view from the drop-down list.
A maximum of 10 hot call stacks can be displayed.

Figure 3 OpenMP timeline

**Table 8** Parameters on the OpenMP timeline tab page
Parameter	Description
TID	Thread ID.
Region Type	Region type of a thread.
Start Time	Start time of a phase for a thread.
Duration	Duration of a phase for a thread.
CPI	Ratio of CPU cycles/Retired instructions, which indicates the clock cycle consumed by each instruction.
Instructions Retired	Total number of instructions.
Callstack	Name of the call stack.
Call Times	Number of times that the stack is called.
Invoking Ratio (%)	Percentage of the called stack in all stacks.
Event Name	Name of the top-down event.
Event Ratio (%)	Proportion of the top-down event.

Figure 4 shows the MPI timeline tab page. Table 9 describes the parameters.

If you select RDMA and Shared storage when creating an analysis task, you can click to view related data. You can further click a time point in the line chart to view the details.

Figure 4 MPI timeline

**Table 9** MPI timeline parameters
Parameter	Description
Basic Rank Information
rank ID	ID of the selected rank.
Start Time	Start time of a phase for a thread.
Duration	Duration of a phase for a thread.
CPI	Ratio of CPU cycles/Retired instructions, which indicates the clock cycle consumed by each instruction.
Instructions Retired	Total number of instructions.
Cluster Communication Type	Cluster communication type.
Communicator Root	Communicator root.
Communicator Name	Communicator name.
Communication Data Volume	Amount of data sent and received during communication.
Communicator Members	Number of communicator members.
Communicator Member	Specific communicator member.
Rank Invoking Information
Callstack	Name of the call stack.
Call Times	Number of times that the stack is called.
Invoking Ratio (%)	Percentage of the called stack in all stacks.
Event Name	Name of the top-down event.
Event Ratio (%)	Proportion of the top-down event.
RDMA Information
Node IP Address	IP address of the RDMA.
Collection Time	Collection time of the RDMA data.
Receive	Amount of data received at the current time point.
Send	Amount of data sent at the current time point.
Shared Storage Information
Node IP Address	IP address of the shared storage.
Collection Time	Collection time of the current shared storage data.
Receive	Amount of data received at the current time point.
Send	Amount of data sent at the current time point.

If you select Refined analysis when creating an HPC application analysis task, you can view the Communication Heatmap tab page. See Figure 5.
- By default, Rank to Rank is selected for Statistical Object, Data_Size for Statistical Indicator, Point to Point for Communication Type, and the first item in the drop-down list for Communicator.
- You can select any other statistical object (Node to Node), statistical metric (Latency), communication type (Cluster Communication), and communicator from the drop-down lists. If you select Latency for Statistical Indicator, the communication type can only be Point to Point.
- The data volume of (ranki, rankj) is the data sent by the ranki to rankj plus the data received by ranki from rankj.
- Move the mouse pointer to select an area in the left part of the following figure to view its details on the right part. You can click or or scroll the mouse wheel to zoom in or zoom out the area.
- In the displayed dialog box for selecting a communicator, click to search for communicator name and communicator member, click to sort the communicator members, and click View Details to see the communicator information.
Figure 5 Communication heatmap

Click the drop-down list of Communicator to switch or filter the communicator information to be viewed.

Figure 6 Selecting a communicator

Select Node To Node for Statistical Object to view the rank information. See Figure 7. Major metrics are Local Percentage, Cross-DIE Percentage, and Cross-chip Percentage.

Figure 7 Communication heatmap (node-to-node)

Figure 8 shows the Top N Inefficient Communications tab page. Table 10 describes the parameters.

Select the top N ranks of communication percentages and click the Send or Receive color block to view the rank and communication delay details.

Figure 8 Top N inefficient communications

**Table 10** Parameters on the Top N Inefficient Communications tab page
Parameter	Description
Rank Details
rankID	ID of the selected rank.
Communication Mode	Current communication mode.
region	Region where the current communication resides.
Start Time	Start time of the communication.
End Time	End time of the communication.
Duration	Duration of the communication.
Communication Delay Details
rank-rank	Rank communication details.
Start Time	Start time of the communication.
Communication Delay	Delay of the communication.

Click the Task Information tab to view the detailed configuration and sampling information about the task on the current node.

If the task fails to be executed, the failure cause is displayed on the Task Information tab page.

If some data fails to be collected but the overall task execution is not affected, you can view the exception message in Exception Information.

Collection End Cause displays the reason why the data collection of the current task ends, for example, "Task collection times up" or "File size reaches the collection limit."

Parent topic: HPC Application Analysis