Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Details
Do not mention proprietary info or link to internal work items in this PR.
Work item: Internal
What were the changes?
Replaces loads and stores to data in the simple protocol with cache bypassing load/store intrinsics that emit global dwordx4 load and store instructions on gfx942/gfx950. The changes can be disabled via a killswitch at compile time. This is a no-op when compiled with a pre-ROCm 7.1.1 compiler since it requires intrinsics that are only available after that version.
Why were the changes made?
I found that this was faster since there's not really any meaningful data reuse to speak of at least for the ring-based algorithms. We should just bypass the caches for loads and stores to the data arrays. I saw the biggest improvement for single node AllToAll on gfx942.
How was the outcome achieved?
I worked with Carlo Bertolli and Matthew Curtis to get
__builtin_amdgcn_global_load_b128and__builtin_amdgcn_global_store_b128added to the compiler. These builtins allow writing/reading 16 bytes of data per lane to/from a specified syncscope (e.g., device or system). Device is"agent"here, and system is"". This is consistent with some other AMD builtins for fences.I had initially tried inline assembly, but I found that it was brittle. The compiler has a hard time reasoning about what's inside of the black box of
asm volatile. With AMD GPUs, we additionally must specify waitcnts in a way that is correct without a large performance penalty. After trying my hand at this, I have decided to delegate it to the compiler.Additional Documentation:
I additionally tried this with LL/LL128. I didn't see a consistent benefit over just using extended-scope fine grain memory and PR 1982, so I've held off on that. That may have changed since I last tested it, so I plan on reevaluating.
Builtins require at least ROCm 7.1.1.
Approval Checklist
Do not approve until these items are satisfied.