Unifying HT and ST path #2890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

vlaindic wants to merge 2 commits into develop from users/vlaindic/2nd-lvl-th-unify-ht-and-st

+96 −110

Member

vlaindic commented Jan 27, 2026

Motivation

One of the PRs from the big PR proposed here: #2857 . This PR aims to unify the paths for host-trap and stochastic sample processing.

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

vlaindic requested review from cfreeamd, dayatsin-amd and kentrussell as code owners

January 27, 2026 18:59

vlaindic requested review from Copilot and jlgreathouse

January 27, 2026 18:59

github-actions bot added the project: rocr-runtime label

vlaindic requested a review from lancesix

January 27, 2026 19:00

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR unifies the code paths for host-trap (HT) and stochastic (ST) sample processing in the GFX12 trap handler assembly. The refactoring extracts common sample-filling operations into a shared path (.fill_sample_common) to reduce code duplication.

Changes:

Introduced a common sample-filling path that handles timestamp, exec, workgroup information, HW_ID, and correlation ID for both HT and ST paths
Restructured sample data offsets to separate wave-in-group and chiplet information into a dedicated field
Added PC address masking constants and improved PC handling for both trap types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

projects/rocr-runtime/runtime/hsa-runtime/core/runtime/trap_handler/trap_handler_gfx12.s

    
                // ttmp7 contains WGID_Y in low 16 bits.

                v_writelane_b32   v2, ttmp9, 0                            // wg_id_x

                s_bfe_u32         ttmp6, ttmp7, (16<<16)                  // extract bits 15:0, wg_id_y

                s_bfe_u32         ttmp6, ttmp7, (0 | 16<<16)              // extract bits 15:0, wg_id_y

Copilot AI Jan 27, 2026

The bit field extract expression uses inconsistent spacing compared to line 574. For consistency, format as (0 | (16 << 16)) to match the style used elsewhere in the unified code.

Suggested change

      
              s_bfe_u32         ttmp6, ttmp7, (0 | 16<<16)              // extract bits 15:0, wg_id_y
          
              s_bfe_u32         ttmp6, ttmp7, (0 | (16 << 16))          // extract bits 15:0, wg_id_y

Copilot uses AI. Check for mistakes.

Contributor

jlgreathouse Jan 28, 2026

I agree with this.

projects/rocr-runtime/runtime/hsa-runtime/core/runtime/trap_handler/trap_handler_gfx12.s

Comment on lines +602 to +606

+                // Bits [5:0]   = wave_in_wg (5 bits from ttmp8[29:25])
+                // bits [10:8]  = chiplet (zero on gfx12.0)
+                // Bits [7:6] and [31:11] = reserved and must be zero
+                s_bfe_u32         ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract 5 bits

Copilot AI Jan 27, 2026

The comment says 'Extract 5 bits' but the constant definition at line 121 describes wave_in_wg as occupying bits [29:25], which is actually 5 bits in range but represents a 6-bit capable field. Update the comment to clarify this extracts bits [29:25] for wave_in_wg.

Suggested change

      
              // Bits [5:0]   = wave_in_wg (5 bits from ttmp8[29:25])
          
              // bits [10:8]  = chiplet (zero on gfx12.0)
          
              // Bits [7:6] and [31:11] = reserved and must be zero
          
              s_bfe_u32         ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract 5 bits
          
              // Bits [4:0]   = wave_in_wg value extracted from ttmp8[29:25] (5 bits of a 6-bit-capable field)
          
              // bits [10:8]  = chiplet (zero on gfx12.0)
          
              // Bits [7:5] and [31:11] = reserved and must be zero
          
              s_bfe_u32         ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract bits [29:25] (5 bits) for wave_in_wg

Copilot uses AI. Check for mistakes.

vlaindic requested a review from shwetagkhatri

January 27, 2026 19:00

systems-assistant bot added the organization: ROCm label

jlgreathouse requested changes

View reviewed changes

Contributor

jlgreathouse left a comment

Generally looks like a good refactoring. I think putting the host-trap and stochastic trap common code in the same place will allow for other refactors in the future (e.g., grabbing the perf snapshot registers earlier, then jumping to the common code).

Minor change requested on a few unchanged comments. And I think even the current code is not reading perf snapshot registers in the right order, so a request to fix that.

projects/rocr-runtime/runtime/hsa-runtime/core/runtime/trap_handler/trap_handler_gfx12.s Show resolved Hide resolved

projects/rocr-runtime/runtime/hsa-runtime/core/runtime/trap_handler/trap_handler_gfx12.s Show resolved Hide resolved

projects/rocr-runtime/runtime/hsa-runtime/core/runtime/trap_handler/trap_handler_gfx12.s

    
                // ttmp7 contains WGID_Y in low 16 bits.

                v_writelane_b32   v2, ttmp9, 0                            // wg_id_x

                s_bfe_u32         ttmp6, ttmp7, (16<<16)                  // extract bits 15:0, wg_id_y

                s_bfe_u32         ttmp6, ttmp7, (0 | 16<<16)              // extract bits 15:0, wg_id_y

Contributor

jlgreathouse Jan 28, 2026

I agree with this.

projects/rocr-runtime/runtime/hsa-runtime/core/runtime/trap_handler/trap_handler_gfx12.s

+                // Store wave_in_group and chiplet information in the following format:
+                // Bits [5:0]   = wave_in_wg (5 bits from ttmp8[29:25])
+                // bits [10:8]  = chiplet (zero on gfx12.0)

Contributor

jlgreathouse Jan 28, 2026

Is it worth holding off the chiplet information for a future patch, since it is, like the below bits, must-be-zero so far in this gfx12.0 patchset?

Member Author

vlaindic Jan 28, 2026

You're correct, the zero I introduced this is to ease the future patch integration. And yes, the all bits other than [5:0] representing wave_id are zero. More detailed explanation follows:

  s_bfe_u32         ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract 5 bits
  v_writelane_b32   v2, ttmp6, 0                            // Store wave_in_group in v2

  // Write wave_in_group and chiplet (0 on gfx12.0)
  global_store_b32  v[0:1], v2, off, offset:SAMPLE_OFF_WAVE_IN_GROUP_CHIPLET, scope:SCOPE_SYS

ttmp6[4:0] contains wave id, while other bits are zero, meaning bits ttmp6[10:8] representing chiplet are zero. Then, we copy the content of ttmp6 to v2 (lane 0) and override whatever is in v2 lane 0 to achieve:

v2 lane 0 [4:0] - wave id information
v2 lane 0 [31:5] = 0 including chiplet information.

I didn't want to add another instruction to invalidate chiplet bits as they're already zero.

projects/rocr-runtime/runtime/hsa-runtime/core/runtime/trap_handler/trap_handler_gfx12.s

-                s_getreg_b32      ttmp6, HW_REG_SQ_PERF_SNAPSHOT_PC_HI
-                v_writelane_b32   v3, ttmp6, 0x0
-                global_store_b64  v[0:1], v[2:3], off, offset:SAMPLE_OFF_PC_HOST, scope:SCOPE_SYS  // store PC_HI:PC_LO
+                // For stochastic sampling, use PC from snapshot registers (actual sampled instruction)

Contributor

jlgreathouse Jan 28, 2026

I think these need to be rearranged to happen before the read of SQ_PERF_SNAPSHOT_DATA. The DATA register guards all other perf snapshot registers until the first time it's read.

It's possible that we read DATA and is not in error state. Then before reading PC_LO or PC_HI another snapshot happens on a different wave, leading to the PC_LO and PC_HI values to point to a different PC unrelated to the one where all the data registers came from.

Basically, DATA has to be read last out of all of these.

Member Author

vlaindic Jan 28, 2026

The ordering is done based on the future arch spec. I'll provide more NPI details offline via email. As of now, I don't think we should focus on the stochastic part on GFX12.0, as it's unsupported and never will be.

vlaindic requested a review from jlgreathouse

January 28, 2026 08:49

vlaindic added 2 commits

January 28, 2026 09:49


          Unifying HT and ST path

077d02e


          fixing comment regarding wg_id_y,z

d519ac6

vlaindic force-pushed the users/vlaindic/2nd-lvl-th-unify-ht-and-st branch from d6b3b8d to d519ac6 Compare

January 28, 2026 08:49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Copilot code review Copilot Copilot left review comments

kentrussell Awaiting requested review from kentrussell

dayatsin-amd Awaiting requested review from dayatsin-amd dayatsin-amd is a code owner

cfreeamd Awaiting requested review from cfreeamd cfreeamd is a code owner

lancesix Awaiting requested review from lancesix lancesix is a code owner

shwetagkhatri Awaiting requested review from shwetagkhatri shwetagkhatri is a code owner

jlgreathouse Awaiting requested review from jlgreathouse

Requested changes must be addressed to merge this pull request.

Labels

organization: ROCm project: rocr-runtime