-
Notifications
You must be signed in to change notification settings - Fork 117
[rocprofiler-compute] [Documentation] Add metric descriptions for missing gfx950 metrics #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/vedithal/rocprofiler-compute-fix-metrics-description
Are you sure you want to change the base?
Conversation
Add metric descriptions for gfx950 metrics that were previously missing descriptions across multiple hardware blocks: - Command Processor (CPC): 3 metrics - Workgroup Manager (SPI): 3 metrics - Instruction Mix: 2 metrics - Compute Pipeline: 3 metrics - Local Data Share (LDS): 8 metrics - TA/TD Units: 3 metrics - vL1D Cache: 18 metrics - L2 Cache: 9 metrics Public-facing documentation has also been updated with this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds metric descriptions for 48 previously undocumented gfx950 (AMD Instinct MI350 series) performance metrics. The additions provide essential context for users profiling and analyzing performance on the gfx950 architecture.
Changes:
- Added metric descriptions across 8 hardware blocks: CPC (3 metrics), SPI (3 metrics), Instruction Mix (2 metrics), Compute Pipeline (3 metrics), LDS (8 metrics), TA/TD (3 metrics), vL1D Cache (18 metrics), and L2 Cache (9 metrics)
- Updated per-architecture YAML configuration files and public-facing documentation
- Regenerated configuration hash database for integrity verification
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| gfx950_metrics_description.yaml | Main per-architecture metric definitions file with 48 new metric descriptions |
| .config_hashes.json | Updated hash values for all modified configuration files to maintain integrity verification |
| 1700_l2_cache.yaml | Added descriptions for 3 L2 cache metrics (Bypass Req, Input Buffer Req, Atomic-HBM, Read 128B) |
| 1600_vector_l1_data_cache.yaml | Added descriptions for 18 vL1D cache metrics including stall metrics, Tag RAM requests, and UTCL1 metrics |
| 1500_address_processing_unit_and_data_return_path_ta_td.yaml | Added descriptions for 3 TA/TD instruction metrics for LDS-direct operations |
| 1200_local_data_share_lds.yaml | Added descriptions for 8 LDS metrics including bandwidth metrics, instruction counts, and FIFO full rates |
| 1100_compute_units_compute_pipeline.yaml | Added descriptions for 3 compute pipeline metrics including F8/F6F4 ops and VALU co-issue efficiency |
| 1000_compute_units_instruction_mix.yaml | Added descriptions for 2 instruction mix metrics (MFMA-F6F4, Spill/Stack Coalesceable) |
| 0600_workgroup_manager_spi.yaml | Added descriptions for 3 SPI metrics including scheduler-pipe wave occupancy and utilization |
| 0500_command_processor_cpc_cpf.yaml | Added descriptions for 3 CPC metrics including SYNC FIFO, CANE stall, and ADC utilization |
| metrics_description.yaml | Updated public-facing documentation with all 48 new metric descriptions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | ||
| unit: Requests per Normalization Unit | ||
| Tag RAM 1 Req: | ||
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | ||
| unit: Requests per Normalization Unit | ||
| Tag RAM 2 Req: | ||
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | ||
| unit: Requests per Normalization Unit | ||
| Tag RAM 3 Req: | ||
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The descriptions for Tag RAM 0-3 Req metrics state "L2 cache requests from this vL1D" which is potentially confusing. These are more accurately described as requests within the vL1D cache lookup process that need to check Tag RAM banks, not L2 cache requests. The wording could be clarified to say "vL1D cache requests that mapped to Tag RAM bank X" or "cache line lookups that mapped to Tag RAM bank X" to better reflect that these are vL1D internal operations.
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 1 Req: | |
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 2 Req: | |
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 3 Req: | |
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| rst: The total number of vL1D cache line lookup requests that mapped to Tag RAM bank 0, per normalization unit. These are internal vL1D tag lookup operations, not L2 cache requests. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 1 Req: | |
| rst: The total number of vL1D cache line lookup requests that mapped to Tag RAM bank 1, per normalization unit. These are internal vL1D tag lookup operations, not L2 cache requests. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 2 Req: | |
| rst: The total number of vL1D cache line lookup requests that mapped to Tag RAM bank 2, per normalization unit. These are internal vL1D tag lookup operations, not L2 cache requests. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 3 Req: | |
| rst: The total number of vL1D cache line lookup requests that mapped to Tag RAM bank 3, per normalization unit. These are internal vL1D tag lookup operations, not L2 cache requests. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. |
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | ||
| unit: Requests per Normalization Unit | ||
| Tag RAM 1 Req: | ||
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | ||
| unit: Requests per Normalization Unit | ||
| Tag RAM 2 Req: | ||
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | ||
| unit: Requests per Normalization Unit | ||
| Tag RAM 3 Req: | ||
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The descriptions for Tag RAM 0-3 Req metrics state "L2 cache requests from this vL1D" which is potentially confusing. These are more accurately described as requests within the vL1D cache lookup process that need to check Tag RAM banks, not L2 cache requests. The wording could be clarified to say "vL1D cache requests that mapped to Tag RAM bank X" or "cache line lookups that mapped to Tag RAM bank X" to better reflect that these are vL1D internal operations.
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 1 Req: | |
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 2 Req: | |
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 3 Req: | |
| rst: The total number of L2 cache requests from this vL1D that mapped to Tag RAM bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| rst: The total number of vL1D cache requests (cache line lookups) that mapped to Tag RAM bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 1 Req: | |
| rst: The total number of vL1D cache requests (cache line lookups) that mapped to Tag RAM bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 2 Req: | |
| rst: The total number of vL1D cache requests (cache line lookups) that mapped to Tag RAM bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| unit: Requests per Normalization Unit | |
| Tag RAM 3 Req: | |
| rst: The total number of vL1D cache requests (cache line lookups) that mapped to Tag RAM bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. Distribution across banks affects lookup efficiency. |
| The total number of L2 cache requests from this vL1D that mapped to Tag RAM | ||
| bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks | ||
| for parallel tag lookups. Distribution across banks affects lookup efficiency. | ||
| Tag RAM 1 Req: >- | ||
| The total number of L2 cache requests from this vL1D that mapped to Tag RAM | ||
| bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks | ||
| for parallel tag lookups. Distribution across banks affects lookup efficiency. | ||
| Tag RAM 2 Req: >- | ||
| The total number of L2 cache requests from this vL1D that mapped to Tag RAM | ||
| bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks | ||
| for parallel tag lookups. Distribution across banks affects lookup efficiency. | ||
| Tag RAM 3 Req: >- | ||
| The total number of L2 cache requests from this vL1D that mapped to Tag RAM | ||
| bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks | ||
| for parallel tag lookups. Distribution across banks affects lookup efficiency. |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The descriptions for Tag RAM 0-3 Req metrics state "L2 cache requests from this vL1D" which is potentially confusing. These are more accurately described as requests within the vL1D cache lookup process that need to check Tag RAM banks, not L2 cache requests. The wording could be clarified to say "vL1D cache requests that mapped to Tag RAM bank X" or "cache line lookups that mapped to Tag RAM bank X" to better reflect that these are vL1D internal operations.
| The total number of L2 cache requests from this vL1D that mapped to Tag RAM | |
| bank 0, per normalization unit. The vL1D cache uses multiple Tag RAM banks | |
| for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| Tag RAM 1 Req: >- | |
| The total number of L2 cache requests from this vL1D that mapped to Tag RAM | |
| bank 1, per normalization unit. The vL1D cache uses multiple Tag RAM banks | |
| for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| Tag RAM 2 Req: >- | |
| The total number of L2 cache requests from this vL1D that mapped to Tag RAM | |
| bank 2, per normalization unit. The vL1D cache uses multiple Tag RAM banks | |
| for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| Tag RAM 3 Req: >- | |
| The total number of L2 cache requests from this vL1D that mapped to Tag RAM | |
| bank 3, per normalization unit. The vL1D cache uses multiple Tag RAM banks | |
| for parallel tag lookups. Distribution across banks affects lookup efficiency. | |
| The total number of vL1D cache requests (cache line lookups) that mapped to | |
| Tag RAM bank 0 during the vL1D tag lookup process, per normalization unit. | |
| The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. | |
| Distribution across banks affects lookup efficiency. | |
| Tag RAM 1 Req: >- | |
| The total number of vL1D cache requests (cache line lookups) that mapped to | |
| Tag RAM bank 1 during the vL1D tag lookup process, per normalization unit. | |
| The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. | |
| Distribution across banks affects lookup efficiency. | |
| Tag RAM 2 Req: >- | |
| The total number of vL1D cache requests (cache line lookups) that mapped to | |
| Tag RAM bank 2 during the vL1D tag lookup process, per normalization unit. | |
| The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. | |
| Distribution across banks affects lookup efficiency. | |
| Tag RAM 3 Req: >- | |
| The total number of vL1D cache requests (cache line lookups) that mapped to | |
| Tag RAM bank 3 during the vL1D tag lookup process, per normalization unit. | |
| The vL1D cache uses multiple Tag RAM banks for parallel tag lookups. | |
| Distribution across banks affects lookup efficiency. |
Motivation
This PR adds metric descriptions for 48 gfx950 metrics that were previously missing descriptions across the following hardware blocks:
These descriptions provide essential context for users profiling and analyzing performance on gfx950 architecture, including details on hardware counters, event IDs, and architectural behavior.
Technical Details
This PR uses material such as:
Aforementioned information was fed to RAG based LLM agent which derived the metric descriptions
Changes:
gfx950_metrics_description.yaml)docs/data/metrics_description.yaml) has been updated with all new metric descriptionsJIRA ID
ROCM-1126
AIPROFCOMP-9
Test Plan
Test Result
Submission Checklist