Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions projects/rocprofiler-compute/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
* Synced latest metric descriptions to public facing documentation
* Updated metric units to be more human readable in public facing documentation

* Added missing metric descriptions for gfx950 architecture

### Changed

* Default output format for the underlying ROCprofiler-SDK tool has been changed from ``csv`` to ``rocpd``.
Expand Down
116 changes: 116 additions & 0 deletions projects/rocprofiler-compute/docs/data/metrics_description.yaml

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,15 @@ Panel Config:
translation.
CPC Utilization: Percent of total cycles where the CPC was busy actively doing
any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
CPC SYNC FIFO Full Rate: Percent of CPC sync counter request FIFO busy cycles
where the FIFO was full. High values indicate backpressure in synchronization
counter processing.
CPC CANE Stall Rate: Percent of CPC CANE bus busy cycles where sync counter
requests were stalled. High values indicate contention on the CANE bus for
synchronization counter operations.
CPC ADC Utilization: Percent of ADC busy cycles spent dispatching thread groups.
The ADC is responsible for sending workgroups from the command processor to
the workgroup manager.
CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
for processing.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -181,10 +181,15 @@ Panel Config:
max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
unit: Percent
metrics_description:
Schedule-Pipe Wave Occupancy: Total number of waves occupying all scheduler-pipe
queues. The SPI has 4 scheduler-pipes, each with 8 hardware queues.
Accelerator Utilization: The percent of cycles in the kernel where the accelerator
was actively doing any work.
Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
kernel where the scheduler-pipes were actively doing any work.
Scheduler-Pipe Wave Utilization: Percent of total scheduler-pipe cycles when waves
are actively resident in the scheduler-pipe structures. This measures wave presence
in the pipes before dispatch to compute units.
Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
manager was actively doing any work.
Shader Engine Utilization: The percent of total shader engine cycles in the kernel
Expand All @@ -208,6 +213,9 @@ Panel Config:
The percent of total scheduler-pipe cycles in the kernel where a workgroup
could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
rather than a lack of a CU or SIMD with sufficient resources.
Scheduler-Pipe FIFO Full Rate: Percent of workgroup manager busy cycles where
the event/wave order FIFO was full. High values indicate backpressure in the
scheduler-pipe dispatch ordering logic.
Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
where a workgroup could not be scheduled to a CU due to occupancy limitations
(like a lack of a CU or SIMD with sufficient resources).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,9 @@ Panel Config:
per normalization unit.
Spill/Stack Instr: The total number of spill/stack memory instructions executed
on all compute units on the accelerator, per normalization unit.
Spill/Stack Coalesceable Instr: The total number of coalesceable spill/stack memory
instructions executed on all compute units, per normalization unit. Higher values
indicate better memory access patterns for private memory operations.
Spill/Stack Read: The total number of spill/stack memory read instructions executed
on all compute units on the accelerator, per normalization unit.
Spill/Stack Write: The total number of spill/stack memory write instructions executed
Expand All @@ -318,3 +321,6 @@ Panel Config:
normalization unit.
MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
normalization unit.
MFMA-F6F4: The total number of 4-bit and 6-bit floating point MFMA instructions
issued per normalization unit. This is supported in AMD Instinct MI350 series
(gfx950) and later only.
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,13 @@ Panel Config:
dual-issued instructions. Computed as the ratio of the total number of cycles
spent by the scheduler co-issuing VALU instructions over the total
CU cycles.
VALU Co-Issue Efficiency: >-
The ratio of quad-cycles where two VALU instructions were co-issued (executed
simultaneously) to quad-cycles where only a single VALU instruction was issued.
Measures how efficiently the kernel leverages the MI350's ability to co-issue
VALU instructions. Unlike Dual-issue VALU Utilization which measures percentage
of total time, this metric can exceed 100% when dual-issue cycles outnumber
single-issue cycles.
VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
was busy executing instructions, including both global/generic and spill/scratch
operations (see the VMEM instruction count metrics for more detail). Does not
Expand Down Expand Up @@ -351,6 +358,9 @@ Panel Config:
the VALU or MFMA units, per normalization unit.
IOPs (Total): The total number of integer operations executed on either the VALU
or MFMA units, per normalization unit.
F8 OPs: >-
The total number of 8-bit floating-point operations executed on MFMA units,
per normalization unit. F8 (FP8) uses either E4M3 or E5M2 format.
F16 OPs: The total number of 16-bit floating-point operations executed on either
the VALU or MFMA units, per normalization unit.
BF16 OPs: The total number of 16-bit brain floating-point operations executed
Expand All @@ -359,5 +369,9 @@ Panel Config:
the VALU or MFMA units, per normalization unit.
F64 OPs: The total number of 64-bit floating-point operations executed on either
the VALU or MFMA units, per normalization unit.
F6F4 OPs: >-
The total number of 4-bit and 6-bit floating-point operations executed on
MFMA units, per normalization unit. F6/F4 formats are supported in AMD Instinct
MI350 series (gfx950) and later only.
INT8 OPs: The total number of 8-bit integer operations executed on either the
VALU or MFMA units, per normalization unit.
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,24 @@ Panel Config:
loaded from, stored to, or atomically updated in the LDS divided by total duration.
Does not take into account the execution mask of the wavefront when the instruction
was executed.
LDS LOAD Bandwidth: >-
The effective bandwidth of LDS load operations, accounting for the work-items
executed (execution mask). Calculated as the total bytes loaded from LDS
divided by the kernel duration.
LDS STORE Bandwidth: >-
The effective bandwidth of LDS store operations, accounting for the work-items
executed (execution mask). Calculated as the total bytes stored to LDS
divided by the kernel duration.
LDS ATOMIC Bandwidth: >-
The effective bandwidth of LDS atomic operations, accounting for the work-items
executed (execution mask). Calculated as the total bytes accessed by LDS
atomic operations divided by the kernel duration.
LDS LOAD: >-
The total number of LDS load instructions issued per normalization unit.
LDS STORE: >-
The total number of LDS store instructions issued per normalization unit.
LDS ATOMIC: >-
The total number of LDS atomic instructions issued per normalization unit.
Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
bank conflicts over the number of LDS cycles that would have been required to
Expand Down Expand Up @@ -183,4 +201,12 @@ Panel Config:
Mem Violations: >-
The total number of out-of-bounds accesses made to the LDS, per normalization
unit. This is unused and expected to be zero in most configurations for
modern CDNA\u2122 accelerators.
modern CDNA™ accelerators.
LDS Command FIFO Full Rate: >-
The number of cycles where the LDS command FIFO was full, per normalization
unit. High values indicate backpressure in LDS instruction dispatch, which
may stall wavefronts waiting to issue LDS operations.
LDS Data FIFO Full Rate: >-
The number of cycles where the LDS data FIFO was full, per normalization
unit. High values indicate backpressure in LDS data return path, which may
stall wavefronts waiting for LDS read/atomic results.
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,11 @@ Panel Config:
Global/Generic Read Instructions: The total number of global & generic memory
read instructions executed on all compute units on the accelerator, per normalization
unit.
Global/Generic Read Instructions for LDS: >-
The total number of global & generic memory read instructions that return
data directly to LDS, executed on all compute units on the accelerator, per
normalization unit. These operations bypass the register file and write
results directly into LDS memory.
Global/Generic Write Instructions: The total number of global & generic memory
write instructions executed on all compute units on the accelerator, per normalization
unit.
Expand All @@ -226,6 +231,11 @@ Panel Config:
executed on all compute units on the accelerator, per normalization unit.
Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
executed on all compute units on the accelerator, per normalization unit.
Spill/Stack Read Instructions for LDS: >-
The total number of spill/stack memory read instructions that return data
directly to LDS, executed on all compute units on the accelerator, per
normalization unit. These operations bypass the register file and write
results directly into LDS memory.
Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
executed on all compute units on the accelerator, per normalization unit.
Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -416,6 +416,26 @@ Panel Config:
Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
waiting to issue a request for data to the L2 cache divided by the number of
cycles where the vL1D is active.
Stalled on Address: >-
The ratio of the number of cycles where the vL1D is stalled waiting for the
address processing unit to send address information, divided by the number
of cycles where the vL1D is active.
Stalled on Data: >-
The ratio of the number of cycles where the vL1D is stalled waiting for the
address processing unit to send write/atomic data, divided by the number of
cycles where the vL1D is active.
Stalled on Latency FIFO: >-
The ratio of the number of cycles where the vL1D is stalled because the
latency FIFO (tracking outstanding requests) is full, divided by the number
of cycles where the vL1D is active.
Stalled on Request FIFO: >-
The ratio of the number of cycles where the vL1D is stalled because the
request FIFO (queuing memory requests) is full, divided by the number of
cycles where the vL1D is active.
Stalled on Read Return: >-
The ratio of the number of cycles where the vL1D is stalled waiting for read
data to return from L2 before it can write into the cache, divided by the
number of cycles where the vL1D is active.
Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
due to Read requests with conflicting tags being looked up concurrently, divided
by the number of cycles where the vL1D is active.
Expand Down Expand Up @@ -454,6 +474,22 @@ Panel Config:
value does not consider partial requests, so for instance, if only a single
value is requested in a cache line, the data movement will still be counted
as a full cache line.
Tag RAM 0 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 1 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 2 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 3 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Comment on lines +478 to +492
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptions for Tag RAM 0-3 Req metrics state "L2 cache requests from this vL1D" which is potentially confusing. These are more accurately described as requests within the vL1D cache lookup process that need to check Tag RAM banks, not L2 cache requests. The wording could be clarified to say "vL1D cache requests that mapped to Tag RAM bank X" or "cache line lookups that mapped to Tag RAM bank X" to better reflect that these are vL1D internal operations.

Suggested change
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 1 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 2 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 3 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
The total number of vL1D cache requests (cache line lookups) that mapped to
Tag RAM bank 0 during the vL1D tag lookup process, per normalization unit.
The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
Distribution across banks affects lookup efficiency.
Tag RAM 1 Req: >-
The total number of vL1D cache requests (cache line lookups) that mapped to
Tag RAM bank 1 during the vL1D tag lookup process, per normalization unit.
The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
Distribution across banks affects lookup efficiency.
Tag RAM 2 Req: >-
The total number of vL1D cache requests (cache line lookups) that mapped to
Tag RAM bank 2 during the vL1D tag lookup process, per normalization unit.
The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
Distribution across banks affects lookup efficiency.
Tag RAM 3 Req: >-
The total number of vL1D cache requests (cache line lookups) that mapped to
Tag RAM bank 3 during the vL1D tag lookup process, per normalization unit.
The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
Distribution across banks affects lookup efficiency.

Copilot uses AI. Check for mistakes.
L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
by the vL1D and must be retrieved from the to the L2 Cache per normalization
unit.
Expand All @@ -470,6 +506,38 @@ Panel Config:
L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
cache took to issue and receive acknowledgement of a write request to the L2
Cache. This number also includes requests for atomics without return values.
Inflight Req: >-
The sum of inflight client-to-UTCL1 translation requests per cycle, per
normalization unit. Measures the average number of translation requests
actively being processed by the UTCL1 at any given time.
Misses under Translation Miss: >-
The number of translation requests that missed in the UTCL1 while another
miss for the same translation was already being processed, per normalization
unit. These are secondary misses that occur before the first miss completes.
Cache Full Stall: >-
The number of cycles the UTCL1 was stalled due to the inflight request
counter reaching its maximum capacity, per normalization unit. Indicates
the translation cache cannot accept more requests.
Cache Miss Stall: >-
The number of cycles the UTCL1 was stalled due to multiple cache misses
being arbitrated simultaneously, per normalization unit.
Serialization Stall: >-
The number of cycles the UTCL1 was stalled due to serializing translation
requests through the cache, per normalization unit.
Thrashing Stall: >-
The number of cycles the UTCL1 was stalled due to cache thrashing (rapid
eviction and reload of translations), per normalization unit. This is an
estimation when thrashing detection is active.
Latency FIFO Stall: >-
The number of cycles the UTCL1-to-UTCL2 latency-hiding FIFO was full,
per normalization unit. This FIFO queues requests to the L2 translation
cache (UTCL2).
Resident Page Full Stall: >-
The number of cycles the UTCL1 was stalled because the latency-hiding
FIFO output indicated a non-resident page, per normalization unit.
UTCL2 Stall: >-
The number of cycles the UTCL1 was stalled due to running out of request
credits for the UTCL2 (L2 translation cache), per normalization unit.
NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
TCP instances per normalization unit.
UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -622,6 +622,13 @@ Panel Config:
accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
The L2 cache attempts to evict streaming requests before normal requests when
the L2 is at capacity.
Bypass Req: Number of L2 cache bypass requests measured at the tag block. These
requests skip normal cache lookup and go directly to memory, typically for
uncacheable or write-combined memory accesses.
Input Buffer Req: Number of raw memory requests received through the L2 cache
Input Buffer (IB) from compute units and other hardware clients. This counts
all incoming requests before cache lookup processing and represents the total
incoming demand on the L2 cache.
Probe Req: The number of coherence probe requests made to the L2 cache from outside
the accelerator. On an MI2XX, probe requests may be generated by, for example,
writes to fine-grained device memory or by writes to coarse-grained device memory.
Expand Down Expand Up @@ -659,6 +666,8 @@ Panel Config:
data from any memory location, per normalization unit.
Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
data from any memory location, per normalization unit.
Read (128B): The total number of L2 requests to Infinity Fabric to read 128B
of data from any memory location, per normalization unit.
Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
data from any memory location, per normalization unit. 64B requests for uncached
data are counted as two 32B uncached data requests.
Expand Down Expand Up @@ -700,6 +709,9 @@ Panel Config:
requests due to Infinity Fabric traffic, divided by total duration.
Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
HBM traffic, divided by total duration.
Atomic - HBM: The total number of L2 atomic requests (either 32-byte or 64-byte)
destined for the accelerator's local high-bandwidth memory (HBM), per normalization
unit. These are read-modify-write operations that update memory atomically.
Atomic: The total number of L2 requests to Infinity Fabric to atomically update
32B or 64B of data in any memory location, per normalization unit. See Request
flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
Expand Down
16 changes: 8 additions & 8 deletions projects/rocprofiler-compute/src/utils/.config_hashes.json
Original file line number Diff line number Diff line change
Expand Up @@ -123,17 +123,17 @@
"0200_system_speed_of_light.yaml": "e86cae8714d463cfcca3069764e73e7e",
"0300_memory_chart.yaml": "2c82fa6f81a0dda679706d36b99e7913",
"0400_roofline.yaml": "2bd3b630b72d6d165c0d30cf481136a9",
"0500_command_processor_cpc_cpf.yaml": "5dd5b6a6f14fc0066a2a581f0a119fc3",
"0600_workgroup_manager_spi.yaml": "c885a0a2f73900a53837f01c5f0e89e6",
"0500_command_processor_cpc_cpf.yaml": "f9e23fb9b86cfdfee28648b64c8ec90c",
"0600_workgroup_manager_spi.yaml": "e4a3128bfa8780266bd6866434601615",
"0700_wavefront.yaml": "e034e7eaca38908b20e3ad1e0c13291f",
"1000_compute_units_instruction_mix.yaml": "6f89306b199221960d259c2a8be958b9",
"1100_compute_units_compute_pipeline.yaml": "cac13334c4fe654e081713f168ed2adf",
"1200_local_data_share_lds.yaml": "f6869eed2bedcd0d94dd1b8d99adc30a",
"1000_compute_units_instruction_mix.yaml": "b0a326a9c0b6d118cb11e6a132581929",
"1100_compute_units_compute_pipeline.yaml": "273d8ae0fd7b3e98dc8d3a6825aba6c7",
"1200_local_data_share_lds.yaml": "7d604e5ec11bff862a0df8b7b898bb56",
"1300_instruction_cache.yaml": "e616b2e4ec05c2d91df43cdaabfc9fea",
"1400_scalar_l1_data_cache.yaml": "393c4aea974c05e45590f3053d66e12e",
"1500_address_processing_unit_and_data_return_path_ta_td.yaml": "fcfceb3e4236ca995d0f5e24035375df",
"1600_vector_l1_data_cache.yaml": "0d32f74b9b62bdfff277498e392206ee",
"1700_l2_cache.yaml": "0aa72d65590ee1c38812ef545b8a1b8b",
"1500_address_processing_unit_and_data_return_path_ta_td.yaml": "87aba9c2e08cf2075c919c7180be8370",
"1600_vector_l1_data_cache.yaml": "8669c540b6d83437c8c41fa4f5f66561",
"1700_l2_cache.yaml": "df4b44df63d9f3eb048fc52a9ab56502",
"1800_l2_cache_per_channel.yaml": "5d16f669dbb4fb3fbb28ac14b597a248",
"2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"
}
Expand Down
Loading
Loading