Metric Description

CPU Metrics – Common Microarchitecture Metrics

IPC: Instructions per cycle, which is the average number of instructions a CPU executes per clock cycle. It is a common measure of the CPU's execution efficiency.
INSTRUCTIONS: Total number of instructions executed, reflecting the CPU's instruction processing throughput.
MPKI: Misses per kilo instructions, which is the number of last-level cache misses per 1,000 instructions executed. A lower MPKI indicates better cache utilization.
BPKI: Branch misses per kilo instructions, which is the number of branch mispredictions per 1,000 executed instructions. A lower BPKI indicates higher branch prediction efficiency.
L1D MPKI: Number of L1 data cache misses per thousand instructions.
L1I MPKI: Number of L1 instruction cache misses per thousand instructions.
L2D MPKI: Number of L2 data cache misses per thousand instructions.
L2I MPKI: Number of L2 instruction cache misses per thousand instructions.
DTLB MPKI: Number of data translation lookaside buffer (DTLB) misses per thousand instructions. The translation lookaside buffer (TLB) caches virtual-to-physical address translations for data accesses. A DTLB miss triggers a page table walk, increasing data access latency.
ITLB MPKI: Number of instruction translation lookaside buffer (ITLB) misses per thousand instructions. An ITLB miss delays instruction address translation and affects front-end instruction fetching.
CPU-NUM: Number of CPU cores included in the statistics, reflecting the scope and parallelism of the performance data.

CPU Metrics – Topdown

Retiring: Ratio of instructions successfully executed and committed to the total number of instructions, reflecting the CPU's effective execution efficiency.
Frontend Bound: Performance bottleneck rate caused when the CPU front end (instruction fetch and decode) cannot supply instructions to the back end in a timely manner.
- Fetch Latency Bound: Percentage of time the front end stalls during instruction fetch due to instruction cache misses, ITLB misses, or branch target delays.
- Fetch Bandwidth Bound: Percentage of time the CPU front end is stalled because instruction fetch or decode bandwidth is insufficient to keep up with back-end demand.
Bad Speculation: Percentage of cycles lost due to pipeline flushes caused by branch mispredictions, incorrect path fetching, or speculative execution failures.
- Branch Mispredicts: Percentage of cycles lost due to pipeline flushes and re-execution caused by branch prediction errors.
- Machine Clears: Percentage of performance loss caused by pipeline clearing triggered by non-branch reasons such as exceptions, memory sequence conflicts, and resource conflicts.
Backend Bound: Percentage of performance loss caused by instructions that have been successfully sent to the backend but cannot be executed in a timely manner due to limited execution resources or data access.
Core Bound: Percentage of performance loss caused by instructions in the computation phase due to insufficient execution resources, such as ALUs, FPUs, vector units, and port conflicts.
Memory Bound: Percentage of performance loss due to data cache (L1/L2/L3) misses, TLB misses, or main memory access latency.
CPU-NUM: Number of CPU cores included in the statistics, reflecting the scope and parallelism of the performance data.

CPU Metrics – OS Metrics

Context Switches: Number of times the CPU switches between processes or threads, reflecting the system's scheduling activity.
Migrations: Number of times a process is migrated between CPU cores or NUMA nodes, reflecting the system's load balancing behavior.
page-faults: Number of events that occur when a CPU accesses a memory page that is not currently mapped to physical memory. Page faults are classified into soft page faults, where the page is already in memory, and hard page faults, where the page must be loaded from a drive.
CPU-NUM: Number of CPU cores included in the statistics, reflecting the scope and parallelism of the performance data.

CPU Metrics – INSTRUCTION

Memory(%): Percentage of memory addressing and data access instructions relative to total instructions, reflecting the data access density of the workload.
- Load(%): Utilization of CPU or storage-controller read requests, reflecting resource consumption by read operations.
- Store(%): Utilization of CPU or storage-controller write requests, reflecting resource consumption by write operations.
Scalar(%): Percentage of scalar instructions relative to the total instructions executed by the processor. A scalar instruction operates on a single data element at a time, in contrast to a vector instruction.
- Integer(%): Percentage of integer instructions relative to the total instructions executed by the processor. This type of instruction performs integer (fixed-point) operations, such as addition, subtraction, and bitwise operations, reflecting the integer computing intensity of the workload.
- Floating Point(%): Percentage of floating point instructions relative to the total instructions executed by the processor. These instructions perform floating point (decimal) operations and reflect the floating point computation intensity of the workload.
Vector(%): Percentage of vector instructions relative to total instructions. A vector instruction performs the same operation on multiple data elements simultaneously, enabling data-level parallelism.
- Advanced SIMD(%): Percentage of vector instructions that use Arm Neon (advanced SIMD) relative to total instructions.
- SVE(+loads/stores)(%): Percentage of vector instructions that use Arm Scalable Vector Extension (SVE) relative to total instructions.
- SME(retired)(%): Percentage of instructions that have been successfully completed (retired) by the processor and use Arm Scalable Matrix Extension (SME), relative to total instructions.
  - Integer(%): Percentage of SME instructions performing integer matrix operations (such as integer multiplication and addition) relative to the total number of SME instructions.
  - Floating Point(%): Percentage of SME instructions performing floating-point matrix operations (such as floating-point multiplication and addition) relative to the total number of SME instructions.
Crypto(%): Percentage of CPU hardware encryption instructions (such as AES, SHA, and RSA) to total instructions, reflecting the encryption and decryption intensity of the workload.
Branches(%): Percentage of branch instructions, such as conditional branches, unconditional branches, and function calls, to the total instructions, reflecting the instruction flow complexity of the workload.
- Immediate(%): Percentage of instructions that use immediate operands relative to total instructions executed by the processor. Immediates are constant values encoded directly in instructions (rather than read from registers or memory), for example, a load-immediate instruction that loads 5 into a register, reducing memory access and improving execution efficiency.
- Return(%): Percentage of executed instructions that are function return instructions. This type of instruction returns from a function call and restores the program execution context (for example, restoring the stack pointer and returning to the call site). It is a core instruction in the function call mechanism.
- Indirect(%): Percentage of executed instructions that are indirect instructions. An indirect instruction is one whose target address is not encoded directly in the instruction but is obtained indirectly through a register or memory (for example, indirect jumps and indirect calls). Indirect instructions are commonly used in scenarios such as function pointers and virtual function calls.
Barriers(%): Percentage of executed instructions that are memory barrier instructions, enforcing memory ordering and reflecting the synchronization requirements of multi-threaded/multi-core workloads.
- Instruction Synchronization(%): Percentage of instruction synchronization instructions relative to total instructions. This type of instruction ensures proper ordering of instruction execution, for example, flushing the instruction pipeline and synchronizing execution across cores, to prevent errors caused by out-of-order execution.
- Data Synchronization(%): Percentage of data synchronization instructions relative to total instructions. This type of instruction controls the visibility and consistency of memory data, for example, by flushing caches and synchronizing memory accesses across multiple cores. It is commonly used in multi-threaded data sharing scenarios.
- Data Memory(%): Percentage of instructions that access data memory, relative to total instructions. This type of instruction includes reading data from memory into a register and writing register data back to memory. It is a core instruction for processor–memory interaction, and its proportion reflects the memory access intensity of a program.
Not Retired(%): Percentage of instructions that are executed but not retired by the processor. A commit indicates that the execution result of an instruction is finally confirmed and written to the architectural state (for example, registers and memory). Instructions may remain uncommitted due to execution failures, rollbacks from branch mispredictions, or cache miss interrupts. This metric reflects the overhead caused by invalid or discarded instructions in the processor pipeline.

CPU Metrics – Load_avg

recent 1 min: Average system load over the past minute, reflecting short-term load fluctuations and sensitive to bursts of traffic or tasks.
recent 5 min: Average system load over the past 5 minutes, reflecting medium-term load trends while smoothing out short-term bursts.
recent 15 min: Average system load over the past 15 minutes, reflecting long-term load trends and indicating whether the system is continuously overloaded.

CPU Metrics – Softirqs

NET_TX/s: Number of software interrupts triggered per second to process sent network data packets, reflecting kernel processing pressure on the network stack.
NET_RX/s: Number of software interrupts triggered per second to process received network data packets, reflecting kernel processing pressure on the network stack.
BLOCK/s: Number of software interrupts triggered per second to process drive or block device I/O, reflecting kernel processing pressure from block device I/O.
SCHED/s: Number of soft interrupts scheduled by processes or threads and triggered per second, reflecting kernel scheduling pressure.

CPU Metrics – CPU_stat

ctx_switches/s: Number of process or thread context switches per second on a single CPU core.
interrupts/s: Number of hardware interrupts received per second by a single CPU core, triggered by peripherals (such as NICs, drives, and timers), reflecting hardware interrupt pressure.
soft_interrupts/s: Number of software interrupts processed per second by a single CPU core. It is an important metric of the kernel's asynchronous task load.
cswch/s: Number of voluntary context switches of a process per second. It indicates how often a process yields the CPU voluntarily (for example, waiting for I/O or sleeping), reflecting its blocking and scheduling behavior.
nvcswch/s: Number of involuntary context switches per process per second. It indicates how often a process is preempted due to time slice expiration or higher-priority tasks, reflecting CPU contention and scheduling pressure.

CPU Metrics – CPU_percent

%user: Percentage of CPU time spent executing user-mode programs (applications), reflecting the CPU usage of applications.
%nice: Percentage of CPU time spent executing low-priority user-mode programs, reflecting the CPU usage of low-priority applications.
%system: Percentage of CPU time spent executing kernel-mode programs (system calls and kernel tasks), reflecting the CPU usage of the OS kernel.
%idle: Percentage of CPU time spent idle with no tasks executed. It is a key metric of overall CPU load.
%iowait: Percentage of CPU time spent idle while waiting for drive or block device I/O to complete, reflecting CPU idleness caused by I/O bottlenecks.
%irq: Percentage of CPU time spent processing hardware interrupts, reflecting CPU usage by hardware interrupts.
%softirq: Percentage of CPU time spent processing software interrupts, reflecting CPU usage by software interrupts.
%steal: Percentage of CPU time spent waiting while the hypervisor schedules the CPU to other virtual machines, reflecting CPU resource contention in the virtualization environment.
%guest: Percentage of CPU time spent executing guest-mode programs in virtual machines.
%guest_nice: Percentage of CPU time spent executing low-priority guest-mode programs in virtual machines.

CPU Metrics – CPU_percent (in Process Mode)

%usr: Percentage of CPU time spent executing user-mode programs (application code, excluding system calls), reflecting process-level CPU usage.
%system: Percentage of CPU time spent executing kernel-mode programs (system calls and kernel tasks), reflecting the OS kernel CPU usage caused by processes.
%wait: Ratio of CPU idle time spent waiting for I/O operations (for example, drive or network), reflecting the impact of I/O bottlenecks on processes.
%CPU: Total CPU usage of a process, typically the sum of user-mode and kernel-mode CPU time, reflecting the overall CPU utilization of processes.

Memory Access Metrics – DDRC

DDRC DEVICE: ID of the hardware device associated with the double data rate controller (DDRC), used to identify the specific device.
NUMA: ID of the NUMA node associated with the DDRC, reflecting the NUMA topology relationship between memory and CPU.
ddrc_rd_bw: Actual DDRC read bandwidth (typically in MB/s), reflecting the memory read rate.
ddrc_wr_bw: Actual DDRC write bandwidth (typically in MB/s), reflecting the memory write rate.

Memory Access Metrics – HHA

HHA DEVICE: ID of the hardware device associated with the HCCS Home Agent (HHA), used to identify the specific device.
NUMA: ID of the NUMA node associated with the HHA, reflecting the NUMA topology relationship of the hardware device.
rx_ops_num: Total number of operations received by the HHA.
rx_outer: Number of operations received by the HHA from other sockets.
rx_sccl: Number of operations received by the HHA from other Super Core Clusters (SCCLs) on the same socket.

Memory Access Metrics – Miss Latency

latency: Memory access latency, measured in CPU clock cycles. L2 Miss Latency indicates the latency of accessing L3 when an L2 cache miss occurs, and L3 Miss Latency indicates the latency of accessing DDR memory when an L3 cache miss occurs.
cycles_max: Maximum latency observed in the collected memory access records.
cycles_min: Minimum latency observed in the collected memory access records.
cycles_avg: Average latency observed in the collected memory access records. The average value is used to measure workload access latency on L3 and DDR.

Memory Access Metrics – Mem_info

total(GB): Total system physical memory capacity, a fixed hardware value used as a reference for evaluating memory usage.
available(GB): Memory capacity (including free memory and reclaimable cache) that can be allocated immediately to applications. It is a key metric for actual available memory.
%percent: Percentage of used memory relative to total memory, serving as a key metric of overall memory load.
used(GB): Physical memory in use, including memory allocated to applications, the kernel, and caches.
free(GB): Amount of physical memory not allocated by the system. It is normal for this value to be small because caches also occupy memory.
active(GB): Memory currently used by applications or the kernel that is unlikely to be reclaimed soon.
inactive(GB): Memory that was used but is currently idle and can be reclaimed by the system.
buffers(GB): Memory used by the kernel to buffer block device I/O data.
cached(GB): Memory used by the kernel to cache file system data. It allows the system to utilize idle memory to improve I/O performance.
shared(GB): Memory shared by multiple processes or threads, such as shared libraries or inter-process communication.
slab(GB): Memory used by the kernel to cache small objects.
RSS(GB): Amount of physical memory occupied by a process, excluding memory swapped out to drives. It reflects the process's actual physical memory usage.
VSZ(GB): Total virtual address space accessible to a process, including code, data, shared libraries, and unallocated memory. It reflects the process's virtual memory size.
%MEM: Percentage of total physical memory used by a process.

Memory Access Metrics – Swap_mem

total(GB): Total capacity of the system swap space. It is a fixed value configured by the system and serves as a reference for evaluating swap usage.
used(GB): Swap space currently in use. A steadily increasing value indicates that system physical memory may be insufficient.
free(GB): Swap space currently not in use.
%percent: Percentage of swap space currently in use relative to total swap capacity.
sin(GB): Amount of data read from the swap partition into physical memory, reflecting read pressure when physical memory is insufficient.
sout(GB): Amount of data written from physical memory to the swap partition, reflecting write pressure when physical memory is insufficient.
PSS(GB): Memory usage calculated by evenly dividing shared memory among all processes, reflecting the actual physical memory used by a process.
USS(GB): Amount of memory used exclusively by a process, excluding shared memory. It reflects memory that cannot be reclaimed by other processes.
Swap(GB): Amount of swap memory used by a process, reflecting disk swap usage and potential memory pressure.

IO Metrics – PCIe

PCIE DEVICE: ID of a peripheral connected to the PCIe bus, used to identify a specific PCIe device.
rx_rd_bw: Bandwidth (MB/s) from the CPU to the device, which appears as write bandwidth from the CPU's perspective.
rx_wr_bw: Bandwidth (MB/s) from the device to the CPU, which appears as read bandwidth from the CPU's perspective.

IO Metrics – PA

PA DEVICE: ID of the hardware device on the PA bus, used to identify a specific PA device.

PA2Ring_bw: Bandwidth (MB/s) for data transfer from the PA bus to the Ring bus, reflecting the unidirectional transfer capability from PA to Ring. PA2Ring can be considered as device-to-host traffic or inter-chip traffic (inbound).

PA2Ring_linkX_bw: Transfer bandwidth (MB/s) of a specific link from the PA bus to the Ring bus, where X indicates the link number. It represents the link-level breakdown of PA2Ring_bw.

Ring2PA_bw: Bandwidth (MB/s) for data transfer from the Ring bus to the PA bus, reflecting the unidirectional transfer capability from Ring to PA. Ring2PA can be considered as host-to-device traffic or inter-chip traffic (outbound).

Ring2PA_linkX_bw: Transfer bandwidth (MB/s) of a specific link from the Ring bus to the PA bus, where X indicates the link number. It represents the link-level breakdown of Ring2PA_bw.

Ring2PAs: Total bandwidth (MB/s) for the Ring bus to transfer data to all PA buses, reflecting the overall external transfer capability of the Ring bus.

IO Metrics – IO_info

device: System ID of a block device (for example, sda or nvme0n1), used to identify a specific drive.
tps: Number of I/O operations (reads and writes) completed by a block device per second, reflecting the device's I/O operation frequency.
rkB/s: Amount of data read from a drive by a block device per second, reflecting the device's read throughput.
wkB/s: Amount of data written to a drive from a block device per second, reflecting the device's write throughput.
dkB/s: Amount of data deleted from a drive by a block device per second, reflecting the device's deletion throughput.
areq-sz: Average data size of a single I/O request processed by a block device, reflecting the I/O request characteristics of the workload.
aqu-sz: Average number of I/O requests waiting in the request queue of a block device, reflecting the device's I/O queuing pressure.
await: Average time for a block device to process an I/O request, including both queueing and actual processing time, reflecting overall I/O response latency.
%util: Percentage of time a block device is busy processing I/O requests. It is a key metric of whether the device is fully loaded.
kB_rd/s: Amount of data read from a drive by a block device per second (KB/s), reflecting the device's read throughput.
kB_wr/s: Amount of data written to a drive from a block device per second (KB/s), reflecting the device's write throughput.
kB_ccwr/s: Amount of data written by a process per second that is canceled or delayed by the kernel due to page cache writeback optimizations (KB/s), reflecting write I/O buffering and merging behavior.

Net Metrics – IO_info

IFACE: System identifier of a network interface (for example, eth0 and ens33), used to identify a specific NIC.
rxpck/s: Number of network packets received by a network interface per second, reflecting the packet reception load.
txpck/s: Number of network packets sent by a network interface per second, reflecting the packet sending load.
rxkB/s: Amount of data received by a network interface per second, reflecting the reception bandwidth.
txkB/s: Amount of data sent by a network interface per second, reflecting the transmit bandwidth.
rxcmp/s: Number of compressed network packets received by a network interface per second, reflecting the load of receiving compressed packets.
txcmp/s: Number of compressed network packets sent by a network interface per second, reflecting the load of sending compressed packets.
rxmcst/s: Number of multicast packets received by a network interface per second, reflecting the multicast packet processing load.
%ifutil: Percentage of the network interface's actual transmission bandwidth relative to its nominal bandwidth. It is a key metric of whether the interface is fully loaded.

Parent topic: References