-
Notifications
You must be signed in to change notification settings - Fork 120
Unifying HT and ST path #2890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Unifying HT and ST path #2890
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR unifies the code paths for host-trap (HT) and stochastic (ST) sample processing in the GFX12 trap handler assembly. The refactoring extracts common sample-filling operations into a shared path (.fill_sample_common) to reduce code duplication.
Changes:
- Introduced a common sample-filling path that handles timestamp, exec, workgroup information, HW_ID, and correlation ID for both HT and ST paths
- Restructured sample data offsets to separate wave-in-group and chiplet information into a dedicated field
- Added PC address masking constants and improved PC handling for both trap types
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // ttmp7 contains WGID_Y in low 16 bits. | ||
| v_writelane_b32 v2, ttmp9, 0 // wg_id_x | ||
| s_bfe_u32 ttmp6, ttmp7, (16<<16) // extract bits 15:0, wg_id_y | ||
| s_bfe_u32 ttmp6, ttmp7, (0 | 16<<16) // extract bits 15:0, wg_id_y |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bit field extract expression uses inconsistent spacing compared to line 574. For consistency, format as (0 | (16 << 16)) to match the style used elsewhere in the unified code.
| s_bfe_u32 ttmp6, ttmp7, (0 | 16<<16) // extract bits 15:0, wg_id_y | |
| s_bfe_u32 ttmp6, ttmp7, (0 | (16 << 16)) // extract bits 15:0, wg_id_y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this.
| // Bits [5:0] = wave_in_wg (5 bits from ttmp8[29:25]) | ||
| // bits [10:8] = chiplet (zero on gfx12.0) | ||
| // Bits [7:6] and [31:11] = reserved and must be zero | ||
|
|
||
| s_bfe_u32 ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract 5 bits |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says 'Extract 5 bits' but the constant definition at line 121 describes wave_in_wg as occupying bits [29:25], which is actually 5 bits in range but represents a 6-bit capable field. Update the comment to clarify this extracts bits [29:25] for wave_in_wg.
| // Bits [5:0] = wave_in_wg (5 bits from ttmp8[29:25]) | |
| // bits [10:8] = chiplet (zero on gfx12.0) | |
| // Bits [7:6] and [31:11] = reserved and must be zero | |
| s_bfe_u32 ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract 5 bits | |
| // Bits [4:0] = wave_in_wg value extracted from ttmp8[29:25] (5 bits of a 6-bit-capable field) | |
| // bits [10:8] = chiplet (zero on gfx12.0) | |
| // Bits [7:5] and [31:11] = reserved and must be zero | |
| s_bfe_u32 ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract bits [29:25] (5 bits) for wave_in_wg |
jlgreathouse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks like a good refactoring. I think putting the host-trap and stochastic trap common code in the same place will allow for other refactors in the future (e.g., grabbing the perf snapshot registers earlier, then jumping to the common code).
Minor change requested on a few unchanged comments. And I think even the current code is not reading perf snapshot registers in the right order, so a request to fix that.
projects/rocr-runtime/runtime/hsa-runtime/core/runtime/trap_handler/trap_handler_gfx12.s
Show resolved
Hide resolved
projects/rocr-runtime/runtime/hsa-runtime/core/runtime/trap_handler/trap_handler_gfx12.s
Show resolved
Hide resolved
| // ttmp7 contains WGID_Y in low 16 bits. | ||
| v_writelane_b32 v2, ttmp9, 0 // wg_id_x | ||
| s_bfe_u32 ttmp6, ttmp7, (16<<16) // extract bits 15:0, wg_id_y | ||
| s_bfe_u32 ttmp6, ttmp7, (0 | 16<<16) // extract bits 15:0, wg_id_y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this.
|
|
||
| // Store wave_in_group and chiplet information in the following format: | ||
| // Bits [5:0] = wave_in_wg (5 bits from ttmp8[29:25]) | ||
| // bits [10:8] = chiplet (zero on gfx12.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth holding off the chiplet information for a future patch, since it is, like the below bits, must-be-zero so far in this gfx12.0 patchset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, the zero I introduced this is to ease the future patch integration. And yes, the all bits other than [5:0] representing wave_id are zero. More detailed explanation follows:
s_bfe_u32 ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract 5 bits
v_writelane_b32 v2, ttmp6, 0 // Store wave_in_group in v2
// Write wave_in_group and chiplet (0 on gfx12.0)
global_store_b32 v[0:1], v2, off, offset:SAMPLE_OFF_WAVE_IN_GROUP_CHIPLET, scope:SCOPE_SYS
ttmp6[4:0] contains wave id, while other bits are zero, meaning bits ttmp6[10:8] representing chiplet are zero. Then, we copy the content of ttmp6 to v2 (lane 0) and override whatever is in v2 lane 0 to achieve:
- v2 lane 0 [4:0] - wave id information
- v2 lane 0 [31:5] = 0 including chiplet information.
I didn't want to add another instruction to invalidate chiplet bits as they're already zero.
| s_getreg_b32 ttmp6, HW_REG_SQ_PERF_SNAPSHOT_PC_HI | ||
| v_writelane_b32 v3, ttmp6, 0x0 | ||
| global_store_b64 v[0:1], v[2:3], off, offset:SAMPLE_OFF_PC_HOST, scope:SCOPE_SYS // store PC_HI:PC_LO | ||
| // For stochastic sampling, use PC from snapshot registers (actual sampled instruction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these need to be rearranged to happen before the read of SQ_PERF_SNAPSHOT_DATA. The DATA register guards all other perf snapshot registers until the first time it's read.
It's possible that we read DATA and is not in error state. Then before reading PC_LO or PC_HI another snapshot happens on a different wave, leading to the PC_LO and PC_HI values to point to a different PC unrelated to the one where all the data registers came from.
Basically, DATA has to be read last out of all of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ordering is done based on the future arch spec. I'll provide more NPI details offline via email. As of now, I don't think we should focus on the stochastic part on GFX12.0, as it's unsupported and never will be.
d6b3b8d to
d519ac6
Compare
Motivation
One of the PRs from the big PR proposed here: #2857 . This PR aims to unify the paths for host-trap and stochastic sample processing.
Technical Details
JIRA ID
Test Plan
Test Result
Submission Checklist