[megatron] fix: patch support newer mcore version#5372
[megatron] fix: patch support newer mcore version#5372HollowMan6 wants to merge 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
The pull request successfully updates the Megatron-Core (mcore) patching logic to support version 0.13.0 and newer. The changes correctly handle the updated return signatures of internal methods like get_query_key_value_tensors and _adjust_key_value_for_inference, and appropriately skip legacy patches that are no longer required in newer mcore versions. The implementation is robust and maintains backward compatibility with mcore 0.12. No high-severity issues were identified in the changes.
There was a problem hiding this comment.
Pull request overview
This PR enhances Megatron Core version compatibility by updating the Multi-Latent Attention (MLA) patch to support mcore >= 0.13. The key change is to use defensive unpacking for the return values of get_query_key_value_tensors, and to conditionally apply the legacy patch only for mcore < 0.13, as newer versions have the fix upstream.
Changes:
- Changed from direct tuple unpacking to slice-based unpacking (
qkv[:3]) to handle variable return values across mcore versions - Added conditional patch application for
get_query_key_value_tensors- only applies to mcore < 0.13 - Added explanatory comments about the version-specific behavior
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| packed_seq_params, | ||
| inference_context=inference_context, | ||
| ) | ||
| query, key, value = qkv[:3] |
There was a problem hiding this comment.
What's the meaning of this change?
There was a problem hiding this comment.
In NVIDIA/Megatron-LM@382eeea#diff-d7e453d550e7f78221ea3445018264015c1e83c0db892a41253bde71d069a202L247, the return value for get_query_key_value_tensors has changed from query, key, value to query, key, value, q_compressed, kv_compressed. kv_compressed is not used in the forward pass so it just gets ignored, here I write in this way to take the first 3 returned values for backward compatibility.
I just found that q_compressed is actually used for DSA like DeepSeek V3.2, so I made and pushed a modification to fix it.
There was a problem hiding this comment.
Do we still need this whole patch for newer version megatron?
There was a problem hiding this comment.
Yes, as unfortunately NVIDIA/TransformerEngine#2629 hasn't been merged into TE.
There was a problem hiding this comment.
and neither NVIDIA/Megatron-LM#3003 for mcore, if one of them gets merged, then we don't need this patch to support flash attention for MLA.
0c15d6b to
b9a8111
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b9a8111 to
3f7b80a
Compare
There was a problem hiding this comment.
Code Review
This pull request aims to add support for newer versions of megatron-core by adapting to changes in the attention mechanism API. The changes correctly handle a new potential return value (q_compressed) from get_query_key_value_tensors and pass it to core_attention when the dsa attention variant is used. The conditional patching based on the mcore version is also a good approach to maintain backward compatibility. I have one suggestion to improve robustness by adding an assertion to ensure q_compressed is available when required, which will make potential configuration errors easier to debug.
Tested on NVIDIA/Megatron-LM@bbbedbb Signed-off-by: Hollow Man <hollowman@opensuse.org>
3f7b80a to
1f95774
Compare
What does this PR do?
Patch support newer mcore version
Tested on NVIDIA/Megatron-LM@bbbedbb
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.