Skip to content

Conversation

@alex-breslow-amd
Copy link
Contributor

@alex-breslow-amd alex-breslow-amd commented Jan 23, 2026

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: Internal

What were the changes?
Replaces loads and stores to data in the simple protocol with cache bypassing load/store intrinsics that emit global dwordx4 load and store instructions on gfx942/gfx950. The changes can be disabled via a killswitch at compile time. This is a no-op when compiled with a pre-ROCm 7.1.1 compiler since it requires intrinsics that are only available after that version.

Why were the changes made?
I found that this was faster since there's not really any meaningful data reuse to speak of at least for the ring-based algorithms. We should just bypass the caches for loads and stores to the data arrays. I saw the biggest improvement for single node AllToAll on gfx942.

How was the outcome achieved?
I worked with Carlo Bertolli and Matthew Curtis to get __builtin_amdgcn_global_load_b128 and __builtin_amdgcn_global_store_b128 added to the compiler. These builtins allow writing/reading 16 bytes of data per lane to/from a specified syncscope (e.g., device or system). Device is "agent" here, and system is "". This is consistent with some other AMD builtins for fences.

I had initially tried inline assembly, but I found that it was brittle. The compiler has a hard time reasoning about what's inside of the black box of asm volatile. With AMD GPUs, we additionally must specify waitcnts in a way that is correct without a large performance penalty. After trying my hand at this, I have decided to delegate it to the compiler.

Additional Documentation:
I additionally tried this with LL/LL128. I didn't see a consistent benefit over just using extended-scope fine grain memory and PR 1982, so I've held off on that. That may have changed since I last tested it, so I plan on reevaluating.

Builtins require at least ROCm 7.1.1.

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants