Skip to content

Conversation

@vedithal-amd
Copy link
Contributor

@vedithal-amd vedithal-amd commented Jan 27, 2026

Motivation

This PR adds metric descriptions for 48 gfx950 metrics that were previously missing descriptions across the following hardware blocks:

  • Command Processor (CPC): 3 metrics
  • Workgroup Manager (SPI): 3 metrics
  • Instruction Mix: 2 metrics
  • Compute Pipeline: 3 metrics
  • Local Data Share (LDS): 8 metrics
  • TA/TD Units: 3 metrics
  • vL1D Cache: 18 metrics
  • L2 Cache: 9 metrics

These descriptions provide essential context for users profiling and analyzing performance on gfx950 architecture, including details on hardware counters, event IDs, and architectural behavior.

Technical Details

This PR uses material such as:

  • counter definition
  • metric formulas used in avg/min/max/value/peak fields
  • metric descriptions of similar metrics
  • MI 350 architecture documentation

Aforementioned information was fed to RAG based LLM agent which derived the metric descriptions

Changes:

  • Added metric descriptions to analysis config YAML files for gfx950
  • Updated per-architecture metric definitions (gfx950_metrics_description.yaml)
  • Public-facing documentation (docs/data/metrics_description.yaml) has been updated with all new metric descriptions
  • Regenerated delta files for older architectures (gfx941, gfx942)
  • Updated hash database for configuration integrity verification
  • All pre-commit hooks pass successfully

JIRA ID

ROCM-1126

AIPROFCOMP-9

Test Plan

Test Result

Submission Checklist

Add metric descriptions for gfx950 metrics that were previously missing
descriptions across multiple hardware blocks:

- Command Processor (CPC): 3 metrics
- Workgroup Manager (SPI): 3 metrics
- Instruction Mix: 2 metrics
- Compute Pipeline: 3 metrics
- Local Data Share (LDS): 8 metrics
- TA/TD Units: 3 metrics
- vL1D Cache: 18 metrics
- L2 Cache: 9 metrics

Public-facing documentation has also been updated with this information.
Copilot AI review requested due to automatic review settings January 27, 2026 00:55
@vedithal-amd vedithal-amd requested review from a team and prbasyal-amd as code owners January 27, 2026 00:55
@github-actions github-actions bot added documentation Improvements or additions to documentation project: rocprofiler-compute labels Jan 27, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds metric descriptions for 48 previously undocumented gfx950 (AMD Instinct MI350 series) performance metrics. The additions provide essential context for users profiling and analyzing performance on the gfx950 architecture.

Changes:

  • Added metric descriptions across 8 hardware blocks: CPC (3 metrics), SPI (3 metrics), Instruction Mix (2 metrics), Compute Pipeline (3 metrics), LDS (8 metrics), TA/TD (3 metrics), vL1D Cache (18 metrics), and L2 Cache (9 metrics)
  • Updated per-architecture YAML configuration files and public-facing documentation
  • Regenerated configuration hash database for integrity verification

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
gfx950_metrics_description.yaml Main per-architecture metric definitions file with 48 new metric descriptions
.config_hashes.json Updated hash values for all modified configuration files to maintain integrity verification
1700_l2_cache.yaml Added descriptions for 3 L2 cache metrics (Bypass Req, Input Buffer Req, Atomic-HBM, Read 128B)
1600_vector_l1_data_cache.yaml Added descriptions for 18 vL1D cache metrics including stall metrics, Tag RAM requests, and UTCL1 metrics
1500_address_processing_unit_and_data_return_path_ta_td.yaml Added descriptions for 3 TA/TD instruction metrics for LDS-direct operations
1200_local_data_share_lds.yaml Added descriptions for 8 LDS metrics including bandwidth metrics, instruction counts, and FIFO full rates
1100_compute_units_compute_pipeline.yaml Added descriptions for 3 compute pipeline metrics including F8/F6F4 ops and VALU co-issue efficiency
1000_compute_units_instruction_mix.yaml Added descriptions for 2 instruction mix metrics (MFMA-F6F4, Spill/Stack Coalesceable)
0600_workgroup_manager_spi.yaml Added descriptions for 3 SPI metrics including scheduler-pipe wave occupancy and utilization
0500_command_processor_cpc_cpf.yaml Added descriptions for 3 CPC metrics including SYNC FIFO, CANE stall, and ADC utilization
metrics_description.yaml Updated public-facing documentation with all 48 new metric descriptions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +425 to +434
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 1 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 2 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 3 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptions for Tag RAM 0-3 Req metrics state "L2 cache requests from this vL1D" which is potentially confusing. These are more accurately described as requests within the vL1D cache lookup process that need to check Tag RAM banks, not L2 cache requests. The wording could be clarified to say "vL1D cache requests that mapped to Tag RAM bank X" or "cache line lookups that mapped to Tag RAM bank X" to better reflect that these are vL1D internal operations.

Suggested change
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 1 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 2 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 3 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
rst: The total number of vL1D cache line lookup requests that mapped to Tag RAM bank 0, per normalization unit. These are internal vL1D tag lookup operations, not L2 cache requests. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 1 Req:
rst: The total number of vL1D cache line lookup requests that mapped to Tag RAM bank 1, per normalization unit. These are internal vL1D tag lookup operations, not L2 cache requests. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 2 Req:
rst: The total number of vL1D cache line lookup requests that mapped to Tag RAM bank 2, per normalization unit. These are internal vL1D tag lookup operations, not L2 cache requests. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 3 Req:
rst: The total number of vL1D cache line lookup requests that mapped to Tag RAM bank 3, per normalization unit. These are internal vL1D tag lookup operations, not L2 cache requests. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.

Copilot uses AI. Check for mistakes.
Comment on lines +961 to +970
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 1 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 2 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 3 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptions for Tag RAM 0-3 Req metrics state "L2 cache requests from this vL1D" which is potentially confusing. These are more accurately described as requests within the vL1D cache lookup process that need to check Tag RAM banks, not L2 cache requests. The wording could be clarified to say "vL1D cache requests that mapped to Tag RAM bank X" or "cache line lookups that mapped to Tag RAM bank X" to better reflect that these are vL1D internal operations.

Suggested change
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 1 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 2 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 3 Req:
rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
rst: The total number of vL1D cache requests (cache line lookups) that mapped to Tag RAM bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 1 Req:
rst: The total number of vL1D cache requests (cache line lookups) that mapped to Tag RAM bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 2 Req:
rst: The total number of vL1D cache requests (cache line lookups) that mapped to Tag RAM bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.
unit: Requests per Normalization Unit
Tag RAM 3 Req:
rst: The total number of vL1D cache requests (cache line lookups) that mapped to Tag RAM bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency.

Copilot uses AI. Check for mistakes.
Comment on lines +478 to +492
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 1 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 2 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 3 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descriptions for Tag RAM 0-3 Req metrics state "L2 cache requests from this vL1D" which is potentially confusing. These are more accurately described as requests within the vL1D cache lookup process that need to check Tag RAM banks, not L2 cache requests. The wording could be clarified to say "vL1D cache requests that mapped to Tag RAM bank X" or "cache line lookups that mapped to Tag RAM bank X" to better reflect that these are vL1D internal operations.

Suggested change
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 1 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 2 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
Tag RAM 3 Req: >-
The total number of L2 cache requests from this vL1D that mapped to Tag RAM
bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks
for parallel tag lookups. Distribution across banks affects lookup efficiency.
The total number of vL1D cache requests (cache line lookups) that mapped to
Tag RAM bank 0 during the vL1D tag lookup process, per normalization unit.
The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
Distribution across banks affects lookup efficiency.
Tag RAM 1 Req: >-
The total number of vL1D cache requests (cache line lookups) that mapped to
Tag RAM bank 1 during the vL1D tag lookup process, per normalization unit.
The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
Distribution across banks affects lookup efficiency.
Tag RAM 2 Req: >-
The total number of vL1D cache requests (cache line lookups) that mapped to
Tag RAM bank 2 during the vL1D tag lookup process, per normalization unit.
The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
Distribution across banks affects lookup efficiency.
Tag RAM 3 Req: >-
The total number of vL1D cache requests (cache line lookups) that mapped to
Tag RAM bank 3 during the vL1D tag lookup process, per normalization unit.
The vL1D cache uses multiple Tag RAM banks for parallel tag lookups.
Distribution across banks affects lookup efficiency.

Copilot uses AI. Check for mistakes.
@vedithal-amd vedithal-amd changed the title [rocprofiler-compute] Add metric descriptions for missing gfx950 metrics [rocprofiler-compute] [Documentation] Add metric descriptions for missing gfx950 metrics Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants