[Megatron] feat: Support routing replay on NPU with performance and compatibility enhancements#5298
Open
755651978 wants to merge 12 commits intoverl-project:mainfrom
Open
[Megatron] feat: Support routing replay on NPU with performance and compatibility enhancements#5298755651978 wants to merge 12 commits intoverl-project:mainfrom
755651978 wants to merge 12 commits intoverl-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces support for MoE Routing Replay on NPU platforms and adds several compatibility patches for older Megatron versions. The changes include dynamic method injection, adaptive configuration patching using introspection, and data alignment fixes for deterministic rollouts. While the changes are generally well-implemented and improve compatibility, I've identified a critical copy-paste bug in the TransformerConfig patching logic that could lead to incorrect behavior or runtime errors. Please address this issue.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Li-Yongwen
reviewed
Feb 12, 2026
wucong25
reviewed
Feb 13, 2026
wucong25
reviewed
Feb 13, 2026
- Added dynamic max_tokens logic and attribute safety checks. - Synchronized code comments to English and fixed import dependencies.
wucong25
approved these changes
Feb 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR enables MoE Routing Replay on NPU (MindSpeed) platforms and provides critical compatibility patches for Megatron 0.12.1. It addresses the absence of key routing methods in older Megatron versions and ensures end-to-end data alignment from the rollout phase to the training phase.
Key Enhancements
Compatibility Injection: Since Megatron 0.12.1 lacks the is_aux_loss_enabled() method, this PR implements it at the module level. Using types.MethodType, we dynamically bind this method to the router instance, ensuring consistency with newer Megatron APIs.
MindSpeed Resilience: Implements class-level attribute injection for enable_routing_replay and moe_router_fusion. This prevents MindSpeed’s dynamic dataclass reconstruction from stripping verl-specific configurations during NPU initialization.
Dynamic Signature Detection: Uses inspect.signature to adapt to different TransformerConfig versions and vp_stage (Virtual Pipeline) logic, ensuring correct layer offset mapping in complex pipeline-parallel NPU setups.
Deterministic Rollout: Standardizes max_tokens to a fixed response_length in the vLLM rollout worker. This prevents shape mismatches in routed_experts caused by fluctuating prompt lengths.
Data Preservation: Uses safe getattr calls to ensure that routing metadata captured during the agent's interaction loop is successfully passed to AgentLoopOutput for training.
Testing & Validation
Environment: Tested on Ascend NPU with MindSpeed.
Routing Consistency: Verified that the routed_experts generated during the rollout phase perfectly match the indices replayed during the training phase.
Performance: Benchmarked the forward pass; relocating helper functions resulted in a measurable reduction in Python-level overhead per iteration.