[algo] feat: support router replay in MegatronEngine by xhx1022 · Pull Request #5219 · verl-project/verl

xhx1022 · 2026-02-06T10:20:19Z

What does this PR do?

This PR introduces Router Replay support within the MegatronEngine, enabling the router computed in compute_log_logp to be reused by update_actor.

Signed-off-by: xhx1022 <1737006628@qq.com>

CLAassistant · 2026-02-06T10:20:31Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces router replay support in the MegatronEngine, a valuable feature for Mixture-of-Experts models that enables recording and reusing router decisions to ensure deterministic training. The changes are well-structured, touching the PPO trainer, Megatron engine, and supporting utility files, and correctly add support for nested tensors and virtual pipeline parallelism. I've identified one critical issue in the pp_gather utility function related to handling uneven layer distribution in pipeline parallelism, which could lead to runtime errors. My feedback is focused on addressing this to improve the robustness of the implementation.

gemini-code-assist · 2026-02-06T10:23:42Z

verl/utils/megatron/router_replay_utils.py

+        layers_topk_idx_global_list = [
+            torch.empty(
+                size=local_layers_router_map.shape,
+                dtype=local_layers_router_map.dtype,
+                device=local_layers_router_map.device,
+            )
+            for _ in range(world_size)
+        ]
+        torch.distributed.all_gather(
+            tensor=local_layers_router_map,
+            tensor_list=layers_topk_idx_global_list,
+            group=pp_group,
+            async_op=False,
        )


The use of torch.distributed.all_gather assumes that local_layers_router_map has the same shape across all pipeline parallel ranks. However, with uneven pipeline parallelism (which is supported by get_num_layers_to_build), the number of layers per rank can differ, leading to different tensor shapes. This will cause all_gather to fail with a shape mismatch error.

This issue is also hinted at in the TODO at line 355. To make this function robust to uneven layer distributions, torch.distributed.all_gather_object should be used instead, as it can handle tensors of varying sizes. Note that the subsequent torch.cat at line 413 will also fail with tensors of different shapes and will need to be adjusted to handle this case when VPP is not enabled.

layers_topk_idx_global_list = [None] * world_size torch.distributed.all_gather_object(layers_topk_idx_global_list, local_layers_router_map, pp_group)

Signed-off-by: xhx1022 <1737006628@qq.com>

xhx1022 added 4 commits February 6, 2026 14:08

support r2,r3 in engine mode

151928c

Signed-off-by: xhx1022 <1737006628@qq.com>

update

3b81925

Signed-off-by: xhx1022 <1737006628@qq.com>

fixed bug

35c6ab2

Signed-off-by: xhx1022 <1737006628@qq.com>

update

e9da9c1

Signed-off-by: xhx1022 <1737006628@qq.com>

xhx1022 requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners February 6, 2026 10:20

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

xhx1022 changed the title ~~[WIP algo] feat: support router replay in MegatronEngine~~ [algo] feat: support router replay in MegatronEngine Feb 6, 2026

add ValueError

6f914f9

Signed-off-by: xhx1022 <1737006628@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[algo] feat: support router replay in MegatronEngine#5219

[algo] feat: support router replay in MegatronEngine#5219
xhx1022 wants to merge 5 commits intoverl-project:mainfrom
xhx1022:r2_engine

xhx1022 commented Feb 6, 2026

Uh oh!

CLAassistant commented Feb 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xhx1022 commented Feb 6, 2026

What does this PR do?

Uh oh!

CLAassistant commented Feb 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants