[BREAKING][megatron] feat: support Megatron-FSDP as a new training backend by conver334 · Pull Request #5423 · verl-project/verl

conver334 · 2026-02-27T06:09:18Z

What does this PR do?

Add Megatron-FSDP as a new training backend option for the Megatron engine. This is implementation of #5244 .

Key changes:

Add use_megatron_fsdp config flag to McoreEngineConfig, enabling users to switch between DDP and Megatron-FSDP via a single config option.
When enabled, automatically configure the required DDP settings (distributed optimizer, Zero-3 sharding strategy, grad reduce overlap) with sensible defaults, while still allowing fine-grained override via override_ddp_config.
Handle FSDP parameter state lifecycle (sync DTensors before weight export, deferred restore before training) to properly switch to rollout mode.
Add example script run_qwen2-7b_math_megatron_fsdp.sh.

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: [megatron, model] feat: qwen3.5 example #5381
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

[TODO]

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Enable Megatron-FSDP by adding a single config flag:

python3 -m verl.trainer.main_ppo --config-path=config \
    --config-name='ppo_megatron_trainer.yaml' \
    actor_rollout_ref.actor.megatron.use_mbridge=True \
    actor_rollout_ref.actor.megatron.vanilla_mbridge=True \
    actor_rollout_ref.actor.megatron.use_megatron_fsdp=True \

The FSDP-specific DDP settings (sharding strategy, overlap, etc.) are auto-configured with defaults. Advanced users can override them:

    actor_rollout_ref.actor.megatron.override_ddp_config.data_parallel_sharding_strategy=optim_grads \
    actor_rollout_ref.actor.megatron.override_ddp_config.overlap_grad_reduce=False \

Design & Code Changes

Megatron-FSDP use the same training loop as Megatron.

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

Signed-off-by: conver334 <conver334@gmail.com>

gemini-code-assist

Code Review

This pull request introduces support for Megatron-FSDP as a new training backend, including configuration flags, automatic DDP settings, and state lifecycle management. The overall implementation is sound, but I've identified a critical issue in the FSDP parameter synchronization logic that could lead to inconsistent model states during inference. Additionally, the new example script has a hardcoded model path, which impacts its portability. I've provided suggestions to fix these issues.

gemini-code-assist · 2026-02-27T06:10:50Z

verl/utils/megatron_utils.py

+    for model_chunk in model_chunks:
+        fsdp = model_chunk.module
+        if getattr(fsdp, "data_parallel_sharding_strategy", None) == "optim_grads_params":
+            fsdp.synchronize_param_gather()
+            return True
+    return False


The function synchronize_megatron_fsdp_params returns immediately after synchronizing the first FSDP module it finds. If model_chunks can contain multiple FSDP-wrapped modules (e.g., with pipeline parallelism), this will result in only the first one being synchronized, potentially leading to inconsistent model states and silent correctness issues during inference. The function should iterate through all model chunks and synchronize all applicable FSDP modules before returning.

Suggested change

for model_chunk in model_chunks:

fsdp = model_chunk.module

if getattr(fsdp, "data_parallel_sharding_strategy", None) == "optim_grads_params":

fsdp.synchronize_param_gather()

return True

return False

synchronized = False

for model_chunk in model_chunks:

fsdp = model_chunk.module

if getattr(fsdp, "data_parallel_sharding_strategy", None) == "optim_grads_params":

fsdp.synchronize_param_gather()

synchronized = True

return synchronized

gemini-code-assist · 2026-02-27T06:10:50Z

examples/grpo_trainer/run_qwen2-7b_math_megatron_fsdp.sh

+    data.max_response_length=512 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=/root/models/Qwen2.5-3B-Instruct \


The model path is hardcoded to /root/models/Qwen2.5-3B-Instruct. This makes the example script not portable and difficult for other users to run without modification. It's better to use an environment variable for the model path to make the script more generic and easier to use.

For example, you could add MODEL_PATH=${MODEL_PATH:-/path/to/your/model} at the top of the script and then use $MODEL_PATH here.

Suggested change

actor_rollout_ref.model.path=/root/models/Qwen2.5-3B-Instruct \

actor_rollout_ref.model.path=${MODEL_PATH} \

conver334 added 2 commits February 26, 2026 21:40

support Megatron-FSDP

95f9b6b

Merge remote-tracking branch 'origin/main' into megatron-fsdp

1c66800

Signed-off-by: conver334 <conver334@gmail.com>

conver334 requested review from FightingZhen, ISEEKYAN, PeterSH6, eric-haibin-lin, ji-huazhong, tardis-key and vermouth1992 as code owners February 27, 2026 06:09

gemini-code-assist bot reviewed Feb 27, 2026

View reviewed changes

ISEEKYAN marked this pull request as draft February 27, 2026 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING][megatron] feat: support Megatron-FSDP as a new training backend#5423

[BREAKING][megatron] feat: support Megatron-FSDP as a new training backend#5423
conver334 wants to merge 2 commits intoverl-project:mainfrom
conver334:megatron-fsdp

conver334 commented Feb 27, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

gemini-code-assist bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	actor_rollout_ref.model.path=/root/models/Qwen2.5-3B-Instruct \
	actor_rollout_ref.model.path=${MODEL_PATH} \

Conversation

conver334 commented Feb 27, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant