[rocprofiler-compute] [Documentation] Add metric descriptions for missing gfx950 metrics #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

vedithal-amd wants to merge 2 commits into users/vedithal/rocprofiler-compute-fix-metrics-description from users/vedithal/rocprofiler-compute-mi350-metric-descriptions

+396 −9

projects/rocprofiler-compute/CHANGELOG.md

-Original file line number
+Diff line change
@@ Expand Up @@
     * Synced latest metric descriptions to public facing documentation
         * Updated metric units to be more human readable in public facing documentation
+    * Added missing metric descriptions for gfx950 architecture
     ### Changed
     * Default output format for the underlying ROCprofiler-SDK tool has been changed from ``csv`` to ``rocpd``.
@@ Expand Down @@

projects/rocprofiler-compute/docs/data/metrics_description.yaml

Large diffs are not rendered by default.

...mpute/src/rocprof_compute_soc/analysis_configs/gfx950/0500_command_processor_cpc_cpf.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -153,6 +153,15 @@ Panel Config: @@
           translation.
         CPC Utilization: Percent of total cycles where the CPC was busy actively doing
           any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
+        CPC SYNC FIFO Full Rate: Percent of CPC sync counter request FIFO busy cycles
+          where the FIFO was full. High values indicate backpressure in synchronization
+          counter processing.
+        CPC CANE Stall Rate: Percent of CPC CANE bus busy cycles where sync counter
+          requests were stalled. High values indicate contention on the CANE bus for
+          synchronization counter operations.
+        CPC ADC Utilization: Percent of ADC busy cycles spent dispatching thread groups.
+          The ADC is responsible for sending workgroups from the command processor to
+          the workgroup manager.
         CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
         CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
           for processing.
@@ Expand Down @@

...r-compute/src/rocprof_compute_soc/analysis_configs/gfx950/0600_workgroup_manager_spi.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -181,10 +181,15 @@ Panel Config: @@
               max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
               unit: Percent
       metrics_description:
+        Schedule-Pipe Wave Occupancy: Total number of waves occupying all scheduler-pipe
+          queues. The SPI has 4 scheduler-pipes, each with 8 hardware queues.
         Accelerator Utilization: The percent of cycles in the kernel where the accelerator
           was actively doing any work.
         Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
           kernel where the scheduler-pipes were actively doing any work.
+        Scheduler-Pipe Wave Utilization: Percent of total scheduler-pipe cycles when waves
+          are actively resident in the scheduler-pipe structures. This measures wave presence
+          in the pipes before dispatch to compute units.
         Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
           manager was actively doing any work.
         Shader Engine Utilization: The percent of total shader engine cycles in the kernel
@@ Expand All / @@ -208,6 +213,9 @@ Panel Config: @@
           The percent of total scheduler-pipe cycles in the kernel where a workgroup
           could not be scheduled to a CU due to a bottleneck within the scheduler-pipes
           rather than a lack of a CU or SIMD with sufficient resources.
+        Scheduler-Pipe FIFO Full Rate: Percent of workgroup manager busy cycles where
+          the event/wave order FIFO was full. High values indicate backpressure in the
+          scheduler-pipe dispatch ordering logic.
         Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
           where a workgroup could not be scheduled to a CU due to occupancy limitations
           (like a lack of a CU or SIMD with sufficient resources).
@@ Expand Down @@

...e/src/rocprof_compute_soc/analysis_configs/gfx950/1000_compute_units_instruction_mix.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -297,6 +297,9 @@ Panel Config: @@
           per normalization unit.
         Spill/Stack Instr: The total number of spill/stack memory instructions executed
           on all compute units on the accelerator, per normalization unit.
+        Spill/Stack Coalesceable Instr: The total number of coalesceable spill/stack memory
+          instructions executed on all compute units, per normalization unit. Higher values
+          indicate better memory access patterns for private memory operations.
         Spill/Stack Read: The total number of spill/stack memory read instructions executed
           on all compute units on the accelerator, per normalization unit.
         Spill/Stack Write: The total number of spill/stack memory write instructions executed
@@ Expand All / @@ -318,3 +321,6 @@ Panel Config: @@
           normalization unit.
         MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
           normalization unit.
+        MFMA-F6F4: The total number of 4-bit and 6-bit floating point MFMA instructions
+          issued per normalization unit. This is supported in AMD Instinct MI350 series
+          (gfx950) and later only.

.../src/rocprof_compute_soc/analysis_configs/gfx950/1100_compute_units_compute_pipeline.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -324,6 +324,13 @@ Panel Config: @@
           dual-issued instructions. Computed as the ratio of the total number of cycles
           spent by the scheduler co-issuing VALU instructions over the total
           CU cycles.
+        VALU Co-Issue Efficiency: >-
+          The ratio of quad-cycles where two VALU instructions were co-issued (executed
+          simultaneously) to quad-cycles where only a single VALU instruction was issued.
+          Measures how efficiently the kernel leverages the MI350's ability to co-issue
+          VALU instructions. Unlike Dual-issue VALU Utilization which measures percentage
+          of total time, this metric can exceed 100% when dual-issue cycles outnumber
+          single-issue cycles.
         VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
           was busy executing instructions, including both global/generic and spill/scratch
           operations (see the VMEM instruction count metrics for more detail). Does not
@@ Expand Down Expand Up / @@ -351,6 +358,9 @@ Panel Config: @@
           the VALU or MFMA units, per normalization unit.
         IOPs (Total): The total number of integer operations executed on either the VALU
           or MFMA units, per normalization unit.
+        F8 OPs: >-
+          The total number of 8-bit floating-point operations executed on MFMA units,
+          per normalization unit. F8 (FP8) uses either E4M3 or E5M2 format.
         F16 OPs: The total number of 16-bit floating-point operations executed on either
           the VALU or MFMA units, per normalization unit.
         BF16 OPs: The total number of 16-bit brain floating-point operations executed
@@ Expand All / @@ -359,5 +369,9 @@ Panel Config: @@
           the VALU or MFMA units, per normalization unit.
         F64 OPs: The total number of 64-bit floating-point operations executed on either
           the VALU or MFMA units, per normalization unit.
+        F6F4 OPs: >-
+          The total number of 4-bit and 6-bit floating-point operations executed on
+          MFMA units, per normalization unit. F6/F4 formats are supported in AMD Instinct
+          MI350 series (gfx950) and later only.
         INT8 OPs: The total number of 8-bit integer operations executed on either the
           VALU or MFMA units, per normalization unit.

...er-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1200_local_data_share_lds.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -155,6 +155,24 @@ Panel Config: @@
           loaded from, stored to, or atomically updated in the LDS divided by total duration.
           Does not take into account the execution mask of the wavefront when the instruction
           was executed.
+        LDS LOAD Bandwidth: >-
+          The effective bandwidth of LDS load operations, accounting for the work-items
+          executed (execution mask). Calculated as the total bytes loaded from LDS
+          divided by the kernel duration.
+        LDS STORE Bandwidth: >-
+          The effective bandwidth of LDS store operations, accounting for the work-items
+          executed (execution mask). Calculated as the total bytes stored to LDS
+          divided by the kernel duration.
+        LDS ATOMIC Bandwidth: >-
+          The effective bandwidth of LDS atomic operations, accounting for the work-items
+          executed (execution mask). Calculated as the total bytes accessed by LDS
+          atomic operations divided by the kernel duration.
+        LDS LOAD: >-
+          The total number of LDS load instructions issued per normalization unit.
+        LDS STORE: >-
+          The total number of LDS store instructions issued per normalization unit.
+        LDS ATOMIC: >-
+          The total number of LDS atomic instructions issued per normalization unit.
         Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
           servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
           bank conflicts over the number of LDS cycles that would have been required to
@@ Expand Down Expand Up / @@ -183,4 +201,12 @@ Panel Config: @@
         Mem Violations: >-
           The total number of out-of-bounds accesses made to the LDS, per normalization
           unit. This is unused and expected to be zero in most configurations for
-          modern CDNA\u2122 accelerators.
+          modern CDNA™ accelerators.
+        LDS Command FIFO Full Rate: >-
+          The number of cycles where the LDS command FIFO was full, per normalization
+          unit. High values indicate backpressure in LDS instruction dispatch, which
+          may stall wavefronts waiting to issue LDS operations.
+        LDS Data FIFO Full Rate: >-
+          The number of cycles where the LDS data FIFO was full, per normalization
+          unit. High values indicate backpressure in LDS data return path, which may
+          stall wavefronts waiting for LDS read/atomic results.

..._soc/analysis_configs/gfx950/1500_address_processing_unit_and_data_return_path_ta_td.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -216,6 +216,11 @@ Panel Config: @@
         Global/Generic Read Instructions: The total number of global & generic memory
           read instructions executed on all compute units on the accelerator, per normalization
           unit.
+        Global/Generic Read Instructions for LDS: >-
+          The total number of global & generic memory read instructions that return
+          data directly to LDS, executed on all compute units on the accelerator, per
+          normalization unit. These operations bypass the register file and write
+          results directly into LDS memory.
         Global/Generic Write Instructions: The total number of global & generic memory
           write instructions executed on all compute units on the accelerator, per normalization
           unit.
@@ Expand All / @@ -226,6 +231,11 @@ Panel Config: @@
           executed on all compute units on the accelerator, per normalization unit.
         Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
           executed on all compute units on the accelerator, per normalization unit.
+        Spill/Stack Read Instructions for LDS: >-
+          The total number of spill/stack memory read instructions that return data
+          directly to LDS, executed on all compute units on the accelerator, per
+          normalization unit. These operations bypass the register file and write
+          results directly into LDS memory.
         Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
           executed on all compute units on the accelerator, per normalization unit.
         Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
@@ Expand Down @@

...er-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1600_vector_l1_data_cache.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -416,6 +416,26 @@ Panel Config: @@
         Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
           waiting to issue a request for data to the L2 cache divided by the number of
           cycles where the vL1D is active.
+        Stalled on Address: >-
+          The ratio of the number of cycles where the vL1D is stalled waiting for the
+          address processing unit to send address information, divided by the number
+          of cycles where the vL1D is active.
+        Stalled on Data: >-
+          The ratio of the number of cycles where the vL1D is stalled waiting for the
+          address processing unit to send write/atomic data, divided by the number of
+          cycles where the vL1D is active.
+        Stalled on Latency FIFO: >-
+          The ratio of the number of cycles where the vL1D is stalled because the
+          latency FIFO (tracking outstanding requests) is full, divided by the number
+          of cycles where the vL1D is active.
+        Stalled on Request FIFO: >-
+          The ratio of the number of cycles where the vL1D is stalled because the
+          request FIFO (queuing memory requests) is full, divided by the number of
+          cycles where the vL1D is active.
+        Stalled on Read Return: >-
+          The ratio of the number of cycles where the vL1D is stalled waiting for read
+          data to return from L2 before it can write into the cache, divided by the
+          number of cycles where the vL1D is active.
         Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
           due to Read requests with conflicting tags being looked up concurrently, divided
           by the number of cycles where the vL1D is active.
@@ Expand Down Expand Up / @@ -454,6 +474,22 @@ Panel Config: @@
           value does not consider partial requests, so for instance, if only a single
           value is requested in a cache line, the data movement will still be counted
           as a full cache line.
+        Tag RAM 0 Req: >-
+          The total number of L2 cache requests from this vL1D that mapped to Tag RAM
+          bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks
+          for parallel tag lookups. Distribution across banks affects lookup efficiency.
+        Tag RAM 1 Req: >-
+          The total number of L2 cache requests from this vL1D that mapped to Tag RAM
+          bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks
+          for parallel tag lookups. Distribution across banks affects lookup efficiency.
+        Tag RAM 2 Req: >-
+          The total number of L2 cache requests from this vL1D that mapped to Tag RAM
+          bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks
+          for parallel tag lookups. Distribution across banks affects lookup efficiency.
+        Tag RAM 3 Req: >-
+          The total number of L2 cache requests from this vL1D that mapped to Tag RAM
+          bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks
+          for parallel tag lookups. Distribution across banks affects lookup efficiency.
-      The total number of L2 cache requests from this vL1D that mapped to Tag RAM
-      bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks
-      for parallel tag lookups. Distribution across banks affects lookup efficiency.
-    Tag RAM 1 Req: >-
-      The total number of L2 cache requests from this vL1D that mapped to Tag RAM
-      bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks
-      for parallel tag lookups. Distribution across banks affects lookup efficiency.
-    Tag RAM 2 Req: >-
-      The total number of L2 cache requests from this vL1D that mapped to Tag RAM
-      bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks
-      for parallel tag lookups. Distribution across banks affects lookup efficiency.
-    Tag RAM 3 Req: >-
-      The total number of L2 cache requests from this vL1D that mapped to Tag RAM
-      bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks
-      for parallel tag lookups. Distribution across banks affects lookup efficiency.
+      The total number of vL1D cache requests (cache line lookups) that mapped to
+      Tag RAM bank 0 during the vL1D tag lookup process, per normalization unit.
+      The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
+      Distribution across banks affects lookup efficiency.
+    Tag RAM 1 Req: >-
+      The total number of vL1D cache requests (cache line lookups) that mapped to
+      Tag RAM bank 1 during the vL1D tag lookup process, per normalization unit.
+      The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
+      Distribution across banks affects lookup efficiency.
+    Tag RAM 2 Req: >-
+      The total number of vL1D cache requests (cache line lookups) that mapped to
+      Tag RAM bank 2 during the vL1D tag lookup process, per normalization unit.
+      The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
+      Distribution across banks affects lookup efficiency.
+    Tag RAM 3 Req: >-
+      The total number of vL1D cache requests (cache line lookups) that mapped to
+      Tag RAM bank 3 during the vL1D tag lookup process, per normalization unit.
+      The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
+      Distribution across banks affects lookup efficiency.
         L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
           by the vL1D and must be retrieved from the to the L2 Cache per normalization
           unit.
@@ Expand All / @@ -470,6 +506,38 @@ Panel Config: @@
         L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
           cache took to issue and receive acknowledgement of a write request to the L2
           Cache. This number also includes requests for atomics without return values.
+        Inflight Req: >-
+          The sum of inflight client-to-UTCL1 translation requests per cycle, per
+          normalization unit. Measures the average number of translation requests
+          actively being processed by the UTCL1 at any given time.
+        Misses under Translation Miss: >-
+          The number of translation requests that missed in the UTCL1 while another
+          miss for the same translation was already being processed, per normalization
+          unit. These are secondary misses that occur before the first miss completes.
+        Cache Full Stall: >-
+          The number of cycles the UTCL1 was stalled due to the inflight request
+          counter reaching its maximum capacity, per normalization unit. Indicates
+          the translation cache cannot accept more requests.
+        Cache Miss Stall: >-
+          The number of cycles the UTCL1 was stalled due to multiple cache misses
+          being arbitrated simultaneously, per normalization unit.
+        Serialization Stall: >-
+          The number of cycles the UTCL1 was stalled due to serializing translation
+          requests through the cache, per normalization unit.
+        Thrashing Stall: >-
+          The number of cycles the UTCL1 was stalled due to cache thrashing (rapid
+          eviction and reload of translations), per normalization unit. This is an
+          estimation when thrashing detection is active.
+        Latency FIFO Stall: >-
+          The number of cycles the UTCL1-to-UTCL2 latency-hiding FIFO was full,
+          per normalization unit. This FIFO queues requests to the L2 translation
+          cache (UTCL2).
+        Resident Page Full Stall: >-
+          The number of cycles the UTCL1 was stalled because the latency-hiding
+          FIFO output indicated a non-resident page, per normalization unit.
+        UTCL2 Stall: >-
+          The number of cycles the UTCL1 was stalled due to running out of request
+          credits for the UTCL2 (L2 translation cache), per normalization unit.
         NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
           TCP instances per normalization unit.
         UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
@@ Expand Down @@

...ts/rocprofiler-compute/src/rocprof_compute_soc/analysis_configs/gfx950/1700_l2_cache.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -622,6 +622,13 @@ Panel Config: @@
           accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
           The L2 cache attempts to evict streaming requests before normal requests when
           the L2 is at capacity.
+        Bypass Req: Number of L2 cache bypass requests measured at the tag block. These
+          requests skip normal cache lookup and go directly to memory, typically for
+          uncacheable or write-combined memory accesses.
+        Input Buffer Req: Number of raw memory requests received through the L2 cache
+          Input Buffer (IB) from compute units and other hardware clients. This counts
+          all incoming requests before cache lookup processing and represents the total
+          incoming demand on the L2 cache.
         Probe Req: The number of coherence probe requests made to the L2 cache from outside
           the accelerator. On an MI2XX, probe requests may be generated by, for example,
           writes to fine-grained device memory or by writes to coarse-grained device memory.
@@ Expand Down Expand Up / @@ -659,6 +666,8 @@ Panel Config: @@
           data from any memory location, per normalization unit.
         Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
           data from any memory location, per normalization unit.
+        Read (128B): The total number of L2 requests to Infinity Fabric to read 128B
+          of data from any memory location, per normalization unit.
         Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
           data from any memory location, per normalization unit. 64B requests for uncached
           data are counted as two 32B uncached data requests.
@@ Expand Down Expand Up / @@ -700,6 +709,9 @@ Panel Config: @@
           requests due to Infinity Fabric traffic, divided by total duration.
         Atomic Bandwidth - HBM: Total number of bytes due to L2 atomic requests due to
           HBM traffic, divided by total duration.
+        Atomic - HBM: The total number of L2 atomic requests (either 32-byte or 64-byte)
+          destined for the accelerator's local high-bandwidth memory (HBM), per normalization
+          unit. These are read-modify-write operations that update memory atomically.
         Atomic: The total number of L2 requests to Infinity Fabric to atomically update
 B or 64B of data in any memory location, per normalization unit. See Request
           flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
@@ Expand Down @@

projects/rocprofiler-compute/src/utils/.config_hashes.json

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -123,17 +123,17 @@
  
            "0200_system_speed_of_light.yaml": "e86cae8714d463cfcca3069764e73e7e",

            "0300_memory_chart.yaml": "2c82fa6f81a0dda679706d36b99e7913",

            "0400_roofline.yaml": "2bd3b630b72d6d165c0d30cf481136a9",

            "0500_command_processor_cpc_cpf.yaml": "5dd5b6a6f14fc0066a2a581f0a119fc3",

            "0600_workgroup_manager_spi.yaml": "c885a0a2f73900a53837f01c5f0e89e6",

            "0500_command_processor_cpc_cpf.yaml": "f9e23fb9b86cfdfee28648b64c8ec90c",

            "0600_workgroup_manager_spi.yaml": "e4a3128bfa8780266bd6866434601615",

            "0700_wavefront.yaml": "e034e7eaca38908b20e3ad1e0c13291f",

            "1000_compute_units_instruction_mix.yaml": "6f89306b199221960d259c2a8be958b9",

            "1100_compute_units_compute_pipeline.yaml": "cac13334c4fe654e081713f168ed2adf",

            "1200_local_data_share_lds.yaml": "f6869eed2bedcd0d94dd1b8d99adc30a",

            "1000_compute_units_instruction_mix.yaml": "b0a326a9c0b6d118cb11e6a132581929",

            "1100_compute_units_compute_pipeline.yaml": "273d8ae0fd7b3e98dc8d3a6825aba6c7",

            "1200_local_data_share_lds.yaml": "7d604e5ec11bff862a0df8b7b898bb56",

            "1300_instruction_cache.yaml": "e616b2e4ec05c2d91df43cdaabfc9fea",

            "1400_scalar_l1_data_cache.yaml": "393c4aea974c05e45590f3053d66e12e",

            "1500_address_processing_unit_and_data_return_path_ta_td.yaml": "fcfceb3e4236ca995d0f5e24035375df",

            "1600_vector_l1_data_cache.yaml": "0d32f74b9b62bdfff277498e392206ee",

            "1700_l2_cache.yaml": "0aa72d65590ee1c38812ef545b8a1b8b",

            "1500_address_processing_unit_and_data_return_path_ta_td.yaml": "87aba9c2e08cf2075c919c7180be8370",

            "1600_vector_l1_data_cache.yaml": "8669c540b6d83437c8c41fa4f5f66561",

            "1700_l2_cache.yaml": "df4b44df63d9f3eb048fc52a9ab56502",

            "1800_l2_cache_per_channel.yaml": "5d16f669dbb4fb3fbb28ac14b597a248",

            "2100_pc_sampling.yaml": "8049866f25214544f1e53a9e2f08399b"

          }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rocprofiler-compute] [Documentation] Add metric descriptions for missing gfx950 metrics #2874

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

[rocprofiler-compute] [Documentation] Add metric descriptions for missing gfx950 metrics #2874

Are you sure you want to change the base?

[rocprofiler-compute] [Documentation] Add metric descriptions for missing gfx950 metrics #2874

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!