Skip to content

Conversation

@vlaindic
Copy link
Member

Motivation

One of the PRs from the big PR proposed here: #2857 . This PR aims to unify the paths for host-trap and stochastic sample processing.

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR unifies the code paths for host-trap (HT) and stochastic (ST) sample processing in the GFX12 trap handler assembly. The refactoring extracts common sample-filling operations into a shared path (.fill_sample_common) to reduce code duplication.

Changes:

  • Introduced a common sample-filling path that handles timestamp, exec, workgroup information, HW_ID, and correlation ID for both HT and ST paths
  • Restructured sample data offsets to separate wave-in-group and chiplet information into a dedicated field
  • Added PC address masking constants and improved PC handling for both trap types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// ttmp7 contains WGID_Y in low 16 bits.
v_writelane_b32 v2, ttmp9, 0 // wg_id_x
s_bfe_u32 ttmp6, ttmp7, (16<<16) // extract bits 15:0, wg_id_y
s_bfe_u32 ttmp6, ttmp7, (0 | 16<<16) // extract bits 15:0, wg_id_y
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit field extract expression uses inconsistent spacing compared to line 574. For consistency, format as (0 | (16 << 16)) to match the style used elsewhere in the unified code.

Suggested change
s_bfe_u32 ttmp6, ttmp7, (0 | 16<<16) // extract bits 15:0, wg_id_y
s_bfe_u32 ttmp6, ttmp7, (0 | (16 << 16)) // extract bits 15:0, wg_id_y

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this.

Comment on lines +602 to +606
// Bits [5:0] = wave_in_wg (5 bits from ttmp8[29:25])
// bits [10:8] = chiplet (zero on gfx12.0)
// Bits [7:6] and [31:11] = reserved and must be zero

s_bfe_u32 ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract 5 bits
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says 'Extract 5 bits' but the constant definition at line 121 describes wave_in_wg as occupying bits [29:25], which is actually 5 bits in range but represents a 6-bit capable field. Update the comment to clarify this extracts bits [29:25] for wave_in_wg.

Suggested change
// Bits [5:0] = wave_in_wg (5 bits from ttmp8[29:25])
// bits [10:8] = chiplet (zero on gfx12.0)
// Bits [7:6] and [31:11] = reserved and must be zero
s_bfe_u32 ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract 5 bits
// Bits [4:0] = wave_in_wg value extracted from ttmp8[29:25] (5 bits of a 6-bit-capable field)
// bits [10:8] = chiplet (zero on gfx12.0)
// Bits [7:5] and [31:11] = reserved and must be zero
s_bfe_u32 ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract bits [29:25] (5 bits) for wave_in_wg

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@jlgreathouse jlgreathouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks like a good refactoring. I think putting the host-trap and stochastic trap common code in the same place will allow for other refactors in the future (e.g., grabbing the perf snapshot registers earlier, then jumping to the common code).

Minor change requested on a few unchanged comments. And I think even the current code is not reading perf snapshot registers in the right order, so a request to fix that.

// ttmp7 contains WGID_Y in low 16 bits.
v_writelane_b32 v2, ttmp9, 0 // wg_id_x
s_bfe_u32 ttmp6, ttmp7, (16<<16) // extract bits 15:0, wg_id_y
s_bfe_u32 ttmp6, ttmp7, (0 | 16<<16) // extract bits 15:0, wg_id_y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this.


// Store wave_in_group and chiplet information in the following format:
// Bits [5:0] = wave_in_wg (5 bits from ttmp8[29:25])
// bits [10:8] = chiplet (zero on gfx12.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth holding off the chiplet information for a future patch, since it is, like the below bits, must-be-zero so far in this gfx12.0 patchset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, the zero I introduced this is to ease the future patch integration. And yes, the all bits other than [5:0] representing wave_id are zero. More detailed explanation follows:

  s_bfe_u32         ttmp6, ttmp8, (WAVE_ID_WG_BIT_POSITION | (5 << 16)) // Extract 5 bits
  v_writelane_b32   v2, ttmp6, 0                            // Store wave_in_group in v2

  // Write wave_in_group and chiplet (0 on gfx12.0)
  global_store_b32  v[0:1], v2, off, offset:SAMPLE_OFF_WAVE_IN_GROUP_CHIPLET, scope:SCOPE_SYS

ttmp6[4:0] contains wave id, while other bits are zero, meaning bits ttmp6[10:8] representing chiplet are zero. Then, we copy the content of ttmp6 to v2 (lane 0) and override whatever is in v2 lane 0 to achieve:

  • v2 lane 0 [4:0] - wave id information
  • v2 lane 0 [31:5] = 0 including chiplet information.

I didn't want to add another instruction to invalidate chiplet bits as they're already zero.

s_getreg_b32 ttmp6, HW_REG_SQ_PERF_SNAPSHOT_PC_HI
v_writelane_b32 v3, ttmp6, 0x0
global_store_b64 v[0:1], v[2:3], off, offset:SAMPLE_OFF_PC_HOST, scope:SCOPE_SYS // store PC_HI:PC_LO
// For stochastic sampling, use PC from snapshot registers (actual sampled instruction)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these need to be rearranged to happen before the read of SQ_PERF_SNAPSHOT_DATA. The DATA register guards all other perf snapshot registers until the first time it's read.

It's possible that we read DATA and is not in error state. Then before reading PC_LO or PC_HI another snapshot happens on a different wave, leading to the PC_LO and PC_HI values to point to a different PC unrelated to the one where all the data registers came from.

Basically, DATA has to be read last out of all of these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ordering is done based on the future arch spec. I'll provide more NPI details offline via email. As of now, I don't think we should focus on the stochastic part on GFX12.0, as it's unsupported and never will be.

@vlaindic vlaindic requested a review from jlgreathouse January 28, 2026 08:49
@vlaindic vlaindic force-pushed the users/vlaindic/2nd-lvl-th-unify-ht-and-st branch from d6b3b8d to d519ac6 Compare January 28, 2026 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants