我要评分
获取效率
正确性
完整性
易理解

Viewing Analysis Results

Prerequisites

An HPC application analysis task has been created and the analysis is complete.

Procedure

  1. In the System Profiler area on the left, click the name of the target analysis task.

    The node list is displayed.

  2. Click the name of the target node to view the analysis results.
    • Click the node name. The Summary tab page is displayed by default, as shown in Figure 1. Table 1 describes the parameters.

      You can click Know-how under Tuning Suggestions or in the lower right corner to see reference tuning operations.

      Figure 1 Summary
      Table 1 Parameters on the Summary tab page

      Parameter

      Description

      Elapsed Time

      Execution time of an application.

      Serial Time

      Serial running time of an application.

      Parallel Time

      How long an application is running in parallel.

      Imbalance

      Running time of the application that is unbalanced.

      CPU Utilization

      CPU usage, that is, the ratio of the CPU usage to the running OpenMP.

      OpenMP Team Usage

      Usage of the OpenMP Team.

      Function

      Invoked functions.

      Module

      Invoked module.

      CPU Time (s)

      CPU usage time.

      Inst Retired

      Number of retired instructions.

      Parallel region

      Parallel region.

      Potential Gain (s)

      Difference between the actual duration and the theoretical duration.

      Imbalance Ratio (%)

      Rate of running applications that are imbalanced.

      Average Time (ms)

      Average running time.

      CPI

      Ratio of CPU cycles/Retired instructions, which indicates the clock cycle consumed by each instruction.

      Effective Utilization

      CPU usage of the thread effective working.

      Spinning

      CPU usage occupied by the thread waiting for spinlock.

      Overhead

      CPU usage of other overheads.

      Instruction Retired

      Number of retired instructions.

      MPI Wait Rate

      Percentage of time spent on the MPI block function.

      Communication

      Percentage of cluster communications to total communications.

      Point to point

      Percentage of time spent on the point-to-point communication function.

      Collective

      Percentage of time spent on the MPI collection function.

      Synchronization

      Percentage of time spent on the synchronization function.

      Table 2 Parameters in the Hotspots area

      Parameter

      Description

      Grouping Mode

      By default, Function is displayed. You can also select Module, parallel-region, or barrier-to-barrier-segment.

      function

      Invoked functions.

      module

      Invoked module.

      parallel-region

      Parallel region.

      barrier-to-barrier-segment

      Special stand-alone section.

      in Loop

      Loop data. This parameter is displayed only when Function is selected for Grouping Mode.

      CPU (%)

      CPU usage.

      CPU (s)

      CPU time.

      Spin (s)

      CPU time for waiting for spinlock.

      Overhead (s)

      CPU time occupied by other overheads.

      CPI

      Ratio of CPU cycles/Retired instructions, which indicates the clock cycle consumed by each instruction.

      Ret (%)

      CPU microarchitecture execution efficiency. The calculation formula is INST_RETIRED / (4 x CPU_CYCLES).

      Back (%)

      Percentage of CPU pipeline execution pauses caused by insufficient resources such as core and memory.

      Mem (%)

      Percentage of CPU pipeline execution pauses caused by memory access latency.

      L1 (%)

      Percentage of CPU pipeline execution pauses caused by L1 cache hits.

      L2 (%)

      Percentage of CPU pipeline execution pauses caused by L2 cache hits.

      L3/M (%)

      Percentage of CPU pipeline execution pauses caused by L2 cache misses.

      Core (%)

      Percentage of CPU pipeline execution pauses due to instructions being executed.

      SIMD (%)

      Percentage of SIMD instructions.

      Front (%)

      Percentage of CPU pipeline execution pauses caused by front-end components.

      Spec (%)

      Percentage of CPU pipeline execution pauses caused by branch prediction execution.

      Instr

      Number of instructions.

      Table 3 Parameters in the Memory Bandwidth area

      Parameter

      Description

      Memory Bandwidth

      Average DRAM Bandwidth

      Average DRAM bandwidth.

      Read Bandwidth

      Average read bandwidth.

      Write Bandwidth

      Average write bandwidth.

      Intra-Socket Bandwidth

      Bandwidth of a socket.

      Cross-Socket Bandwidth

      Cross-socket bandwidth.

      L3 By-Pass Rate

      L3 bypass rate.

      L3 Miss Rate

      L3 miss rate.

      L3 Usage

      L3 cluster usage.

      Command distribution (hover your mouse pointer to the question mark next to a parameter to view details)

      Table 4 Parameters in the HPC Top-Down and PMU Events areas

      Parameter

      Description

      HPC Top-Down

      Event Name

      Name of the top-down event.

      Event Percentage

      Proportion of the top-down event.

      Number of original PMU events

      Miss Events

      Name of the PMU event.

      Count

      Number of PMU events.

      Table 5 MPI runtime metrics

      Parameter

      Description

      Grouping mode

      Filter type. By default, function is selected. You can also select send-type, recv-type, mpi-comm, caller, send-size or recv-size.

      function

      Invoked functions.

      MPI Rank

      Logical working unit.

      Wait Rate (%)

      Percentage of time spent on the MPI block function.

      P2P Comm (%)

      Percentage of time spent on the MPI point-to-point communication function.

      Coll Comm (%)

      Percentage of time spent on the MPI collection function.

      Sync (%)

      Percentage of time spent on the MPI synchronization function.

      Single I/O (%)

      Percentage of time spent on the MPI_File_read and MPI_File_write functions.

      Coll I/O (%)

      Percentage of time spent on the MPI_File_read_all and MPI_File_write_all functions.

      Avg Time

      Average latency.

      Call Count

      Number of calls.

      Data Size (bytes)

      Size of transmitted data.

      Send data type

      Type of sent data.

      Recv data type

      Type of received data.

      Sent

      Working unit that sends data.

      Received

      Working unit that receives data.

      Table 6 OpenMP runtime metrics

      Parameter

      Description

      Parallel region

      Parallel region.

      Barrier-to-barrier segment

      Special stand-alone section.

      Potential Gain (s)

      Difference between the ideal and real wall time of parallel region.

      Elapsed Time (s)

      Wall time of the parallel region.

      Imbalance (s)

      Wall time lost because threads are waiting each other at the end of parallel region.

      Imb (%)

      Ratio of the execution time of unbalanced applications to the total execution time.

      CPU Util (%)

      CPU usage in the parallel region.

      Avg (ms)

      Average latency.

      Count

      Number of calls.

      Lock Cont (s)

      CPU time of a worker thread on a lock that consumes CPU resources.

      Creation (s)

      Overhead of a parallel work assignment.

      Scheduling (s)

      OpenMP runtime scheduler overhead on a parallel work assignment for working threads.

      Tasking (s)

      Time when the task is assigned.

      Reduction (s)

      Runtime overhead on performing reduction operations.

      Atomics (s)

      Runtime overhead on performing atomic operations.

    • Click the MPI Node tab to view the execution information about each node task. The top N MPI hot nodes in a cluster with more than 200,000 cores can be analyzed. See Figure 2. Table 7 describes the parameters.
      Figure 2 MPI Node page
      Table 7 MPI node parameters

      Parameter

      Description

      Node IP Address

      IP addresses of all nodes.

      CPU Usage (%)

      CPU usage of the node.

      CPI

      Ratio of CPU cycles/Retired instructions, which indicates the clock cycle consumed by each instruction.

      Average DRAM Bandwidth (GB/s)

      Average DRAM bandwidth.

      Intra-Socket Bandwidth (GB/s)

      Bandwidth of a socket.

      Cross-Socket Bandwidth (GB/s)

      Cross-socket bandwidth.

      MPI wait rate

      Percentage of time spent on the MPI block function.

      menused (KB)

      Used memory of a node.

      memfree (KB)

      Available memory of a node.

      rd(KB)/s

      Bandwidth of reading data from the device per second.

      wr(KB)/s

      Bandwidth of writing data to the device per second.

      rxkB/s

      Total data received per second, in KB.

      txkB/s

      Total data transmitted per second, in KB.

      Average Power (W)

      Average power of the system.

    • Figure 3 shows the OpenMP timeline tab page. Table 8 describes the parameters.
      • You can use "←" and "→" to switch between threads. Key threads are marked with . Drag the time axis to view the data in the corresponding time range or select key threads you want to view from the drop-down list.
      • A maximum of 10 hot call stacks can be displayed.
      Figure 3 OpenMP timeline
      Table 8 Parameters on the OpenMP timeline tab page

      Parameter

      Description

      TID

      Thread ID.

      Region Type

      Region type of a thread.

      Start Time

      Start time of a phase for a thread.

      Duration

      Duration of a phase for a thread.

      CPI

      Ratio of CPU cycles/Retired instructions, which indicates the clock cycle consumed by each instruction.

      Instructions Retired

      Total number of instructions.

      Callstack

      Name of the call stack.

      Call Times

      Number of times that the stack is called.

      Invoking Ratio (%)

      Percentage of the called stack in all stacks.

      Event Name

      Name of the top-down event.

      Event Ratio (%)

      Proportion of the top-down event.

    • Figure 4 shows the MPI timeline tab page. Table 9 describes the parameters.

      If you select RDMA and Shared storage when creating an analysis task, you can click to view related data. You can further click a time point in the line chart to view the details.

      Figure 4 MPI timeline
      Table 9 MPI timeline parameters

      Parameter

      Description

      Basic Rank Information

      rank ID

      ID of the selected rank.

      Start Time

      Start time of a phase for a thread.

      Duration

      Duration of a phase for a thread.

      CPI

      Ratio of CPU cycles/Retired instructions, which indicates the clock cycle consumed by each instruction.

      Instructions Retired

      Total number of instructions.

      Cluster Communication Type

      Cluster communication type.

      Communicator Root

      Communicator root.

      Communicator Name

      Communicator name.

      Communication Data Volume

      Amount of data sent and received during communication.

      Communicator Members

      Number of communicator members.

      Communicator Member

      Specific communicator member.

      Rank Invoking Information

      Callstack

      Name of the call stack.

      Call Times

      Number of times that the stack is called.

      Invoking Ratio (%)

      Percentage of the called stack in all stacks.

      Event Name

      Name of the top-down event.

      Event Ratio (%)

      Proportion of the top-down event.

      RDMA Information

      Node IP Address

      IP address of the RDMA.

      Collection Time

      Collection time of the RDMA data.

      Receive

      Amount of data received at the current time point.

      Send

      Amount of data sent at the current time point.

      Shared Storage Information

      Node IP Address

      IP address of the shared storage.

      Collection Time

      Collection time of the current shared storage data.

      Receive

      Amount of data received at the current time point.

      Send

      Amount of data sent at the current time point.

    • If you select Refined analysis when creating an HPC application analysis task, you can view the Communication Heatmap tab page. See Figure 5.
      • By default, Rank to Rank is selected for Statistical Object, Data_Size for Statistical Indicator, Point to Point for Communication Type, and the first item in the drop-down list for Communicator.
      • You can select any other statistical object (Node to Node), statistical metric (Latency), communication type (Cluster Communication), and communicator from the drop-down lists. If you select Latency for Statistical Indicator, the communication type can only be Point to Point.
      • The data volume of (ranki, rankj) is the data sent by the ranki to rankj plus the data received by ranki from rankj.
      • Move the mouse pointer to select an area in the left part of the following figure to view its details on the right part. You can click or or scroll the mouse wheel to zoom in or zoom out the area.
      • In the displayed dialog box for selecting a communicator, click to search for communicator name and communicator member, click to sort the communicator members, and click View Details to see the communicator information.
      Figure 5 Communication heatmap

      Click the drop-down list of Communicator to switch or filter the communicator information to be viewed.

      Figure 6 Selecting a communicator

      Select Node To Node for Statistical Object to view the rank information. See Figure 7. Major metrics are Local Percentage, Cross-DIE Percentage, and Cross-chip Percentage.

      Figure 7 Communication heatmap (node-to-node)
    • Figure 8 shows the Top N Inefficient Communications tab page. Table 10 describes the parameters.

      Select the top N ranks of communication percentages and click the Send or Receive color block to view the rank and communication delay details.

      Figure 8 Top N inefficient communications
      Table 10 Parameters on the Top N Inefficient Communications tab page

      Parameter

      Description

      Rank Details

      rankID

      ID of the selected rank.

      Communication Mode

      Current communication mode.

      region

      Region where the current communication resides.

      Start Time

      Start time of the communication.

      End Time

      End time of the communication.

      Duration

      Duration of the communication.

      Communication Delay Details

      rank-rank

      Rank communication details.

      Start Time

      Start time of the communication.

      Communication Delay

      Delay of the communication.

    • Click the Task Information tab to view the detailed configuration and sampling information about the task on the current node.

      If the task fails to be executed, the failure cause is displayed on the Task Information tab page.

      If some data fails to be collected but the overall task execution is not affected, you can view the exception message in Exception Information.

      Collection End Cause displays the reason why the data collection of the current task ends, for example, "Task collection times up" or "File size reaches the collection limit."