[Megatron] feat: Support routing replay on NPU with performance and compatibility enhancements by 755651978 · Pull Request #5298 · verl-project/verl

755651978 · 2026-02-12T07:37:40Z

What does this PR do?

This PR enables MoE Routing Replay on NPU (MindSpeed) platforms and provides critical compatibility patches for Megatron 0.12.1. It addresses the absence of key routing methods in older Megatron versions and ensures end-to-end data alignment from the rollout phase to the training phase.

Key Enhancements

NPU-Compatible Routing Replay & Version Patching (router_replay_patch.py)

Compatibility Injection: Since Megatron 0.12.1 lacks the is_aux_loss_enabled() method, this PR implements it at the module level. Using types.MethodType, we dynamically bind this method to the router instance, ensuring consistency with newer Megatron APIs.

MindSpeed Resilience: Implements class-level attribute injection for enable_routing_replay and moe_router_fusion. This prevents MindSpeed’s dynamic dataclass reconstruction from stripping verl-specific configurations during NPU initialization.

Dynamic Signature Detection: Uses inspect.signature to adapt to different TransformerConfig versions and vp_stage (Virtual Pipeline) logic, ensuring correct layer offset mapping in complex pipeline-parallel NPU setups.

Robust Data Alignment for Agent Loops (tool_agent_loop.py & vllm_async_rollout.py)

Deterministic Rollout: Standardizes max_tokens to a fixed response_length in the vLLM rollout worker. This prevents shape mismatches in routed_experts caused by fluctuating prompt lengths.

Data Preservation: Uses safe getattr calls to ensure that routing metadata captured during the agent's interaction loop is successfully passed to AgentLoopOutput for training.

Testing & Validation

Environment: Tested on Ascend NPU with MindSpeed.

Routing Consistency: Verified that the routed_experts generated during the rollout phase perfectly match the indices replayed during the training phase.

Performance: Benchmarked the forward pass; relocating helper functions resulted in a measurable reduction in Python-level overhead per iteration.

CLAassistant · 2026-02-12T07:37:49Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces support for MoE Routing Replay on NPU platforms and adds several compatibility patches for older Megatron versions. The changes include dynamic method injection, adaptive configuration patching using introspection, and data alignment fixes for deterministic rollouts. While the changes are generally well-implemented and improve compatibility, I've identified a critical copy-paste bug in the TransformerConfig patching logic that could lead to incorrect behavior or runtime errors. Please address this issue.

verl/utils/megatron/router_replay_patch.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

verl/utils/megatron/router_replay_patch.py

verl/utils/megatron/router_replay_utils.py

verl/utils/megatron/router_replay_patch.py

verl/experimental/agent_loop/tool_agent_loop.py

verl/utils/megatron/router_replay_patch.py

- Added dynamic max_tokens logic and attribute safety checks. - Synchronized code comments to English and fixed import dependencies.

…in-0212

Support routing replay for NPU

59c2206

755651978 requested review from PeterSH6, chenhaiq and wuxibin89 as code owners February 12, 2026 07:37

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

verl/utils/megatron/router_replay_patch.py Outdated Show resolved Hide resolved

Apply suggestion from @gemini-code-assist[bot]

2b5acf5

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Li-Yongwen reviewed Feb 12, 2026

View reviewed changes

verl/utils/megatron/router_replay_patch.py Show resolved Hide resolved

755651978 added 2 commits February 12, 2026 22:02

fix: import wraps in router_replay_patch

77af5c7

Merge remote-tracking branch 'origin/main-0212' into main-0212

a9fcbb8

wucong25 reviewed Feb 13, 2026

View reviewed changes

verl/utils/megatron/router_replay_patch.py Show resolved Hide resolved

755651978 added 2 commits February 24, 2026 14:21

feat: support routing replay in vllm rollout and update code standards

b44262f

- Added dynamic max_tokens logic and attribute safety checks. - Synchronized code comments to English and fixed import dependencies.

fix: port routed_experts fix from official c6255ae

1e035b5

755651978 changed the title ~~[Megatron] Support routing replay on NPU with performance and compatibility enhancements~~ [Megatron] feat: Support routing replay on NPU with performance and compatibility enhancements Feb 25, 2026

755651978 added 2 commits February 25, 2026 14:21

fix: resolve ModuleNotFoundError for megatron in CI environments

4c03e9b

Merge remote-tracking branch 'origin/main' into main-0212

b3f63c6

755651978 requested a review from ArronHZG as a code owner February 26, 2026 09:01

755651978 added 4 commits February 26, 2026 17:09

Merge branch 'verl-project:main' into main-0212

08a275e

style: fix linting issues with pre-commit

cda2cd7

Merge branch 'main-0212' of https://github.com/755651978/verl into ma…

9279baa

…in-0212

style: fix linting issues with pre-commit

dc41f8e

wucong25 approved these changes Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Megatron] feat: Support routing replay on NPU with performance and compatibility enhancements#5298

[Megatron] feat: Support routing replay on NPU with performance and compatibility enhancements#5298
755651978 wants to merge 12 commits intoverl-project:mainfrom
755651978:main-0212

755651978 commented Feb 12, 2026

Uh oh!

CLAassistant commented Feb 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

755651978 commented Feb 12, 2026

What does this PR do?

Key Enhancements

Testing & Validation

Uh oh!

CLAassistant commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Feb 12, 2026 •

edited

Loading