Resubmit RCCL PR 2088 #2847

alex-breslow-amd · 2026-01-23T22:22:37Z

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: Internal

What were the changes?
Replaces loads and stores to data in the simple protocol with cache bypassing load/store intrinsics that emit global dwordx4 load and store instructions on gfx942/gfx950. The changes can be disabled via a killswitch at compile time. This is a no-op when compiled with a pre-ROCm 7.1.1 compiler since it requires intrinsics that are only available after that version.

Why were the changes made?
I found that this was faster since there's not really any meaningful data reuse to speak of at least for the ring-based algorithms. We should just bypass the caches for loads and stores to the data arrays. I saw the biggest improvement for single node AllToAll on gfx942.

How was the outcome achieved?
I worked with Carlo Bertolli and Matthew Curtis to get __builtin_amdgcn_global_load_b128 and __builtin_amdgcn_global_store_b128 added to the compiler. These builtins allow writing/reading 16 bytes of data per lane to/from a specified syncscope (e.g., device or system). Device is "agent" here, and system is "". This is consistent with some other AMD builtins for fences.

I had initially tried inline assembly, but I found that it was brittle. The compiler has a hard time reasoning about what's inside of the black box of asm volatile. With AMD GPUs, we additionally must specify waitcnts in a way that is correct without a large performance penalty. After trying my hand at this, I have decided to delegate it to the compiler.

Additional Documentation:
I additionally tried this with LL/LL128. I didn't see a consistent benefit over just using extended-scope fine grain memory and PR 1982, so I've held off on that. That may have changed since I last tested it, so I plan on reevaluating.

Builtins require at least ROCm 7.1.1.

Approval Checklist

Do not approve until these items are satisfied.

Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.

Implement feature

19801c8

alex-breslow-amd requested a review from a team as a code owner January 23, 2026 22:22

github-actions bot added the project: rccl label Jan 23, 2026

systems-assistant bot added the organization: ROCm label Jan 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resubmit RCCL PR 2088 #2847

Resubmit RCCL PR 2088 #2847

alex-breslow-amd commented Jan 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Resubmit RCCL PR 2088 #2847

Are you sure you want to change the base?

Resubmit RCCL PR 2088 #2847

Conversation

alex-breslow-amd commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Approval Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alex-breslow-amd commented Jan 23, 2026 •

edited

Loading