Skip to content

[BREAKING][megatron] feat: support Megatron-FSDP as a new training backend#5423

Draft
conver334 wants to merge 2 commits intoverl-project:mainfrom
conver334:megatron-fsdp
Draft

[BREAKING][megatron] feat: support Megatron-FSDP as a new training backend#5423
conver334 wants to merge 2 commits intoverl-project:mainfrom
conver334:megatron-fsdp

Conversation

@conver334
Copy link
Contributor

What does this PR do?

Add Megatron-FSDP as a new training backend option for the Megatron engine. This is implementation of #5244 .

Key changes:

  • Add use_megatron_fsdp config flag to McoreEngineConfig, enabling users to switch between DDP and Megatron-FSDP via a single config option.
  • When enabled, automatically configure the required DDP settings (distributed optimizer, Zero-3 sharding strategy, grad reduce overlap) with sensible defaults, while still allowing fine-grained override via override_ddp_config.
  • Handle FSDP parameter state lifecycle (sync DTensors before weight export, deferred restore before training) to properly switch to rollout mode.
  • Add example script run_qwen2-7b_math_megatron_fsdp.sh.

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: [megatron, model] feat: qwen3.5 example  #5381
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

[TODO]

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Enable Megatron-FSDP by adding a single config flag:

python3 -m verl.trainer.main_ppo --config-path=config \
    --config-name='ppo_megatron_trainer.yaml' \
    actor_rollout_ref.actor.megatron.use_mbridge=True \
    actor_rollout_ref.actor.megatron.vanilla_mbridge=True \
    actor_rollout_ref.actor.megatron.use_megatron_fsdp=True \

The FSDP-specific DDP settings (sharding strategy, overlap, etc.) are auto-configured with defaults. Advanced users can override them:

    actor_rollout_ref.actor.megatron.override_ddp_config.data_parallel_sharding_strategy=optim_grads \
    actor_rollout_ref.actor.megatron.override_ddp_config.overlap_grad_reduce=False \

Design & Code Changes

Megatron-FSDP use the same training loop as Megatron.

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Megatron-FSDP as a new training backend, including configuration flags, automatic DDP settings, and state lifecycle management. The overall implementation is sound, but I've identified a critical issue in the FSDP parameter synchronization logic that could lead to inconsistent model states during inference. Additionally, the new example script has a hardcoded model path, which impacts its portability. I've provided suggestions to fix these issues.

Comment on lines +363 to +368
for model_chunk in model_chunks:
fsdp = model_chunk.module
if getattr(fsdp, "data_parallel_sharding_strategy", None) == "optim_grads_params":
fsdp.synchronize_param_gather()
return True
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function synchronize_megatron_fsdp_params returns immediately after synchronizing the first FSDP module it finds. If model_chunks can contain multiple FSDP-wrapped modules (e.g., with pipeline parallelism), this will result in only the first one being synchronized, potentially leading to inconsistent model states and silent correctness issues during inference. The function should iterate through all model chunks and synchronize all applicable FSDP modules before returning.

Suggested change
for model_chunk in model_chunks:
fsdp = model_chunk.module
if getattr(fsdp, "data_parallel_sharding_strategy", None) == "optim_grads_params":
fsdp.synchronize_param_gather()
return True
return False
synchronized = False
for model_chunk in model_chunks:
fsdp = model_chunk.module
if getattr(fsdp, "data_parallel_sharding_strategy", None) == "optim_grads_params":
fsdp.synchronize_param_gather()
synchronized = True
return synchronized

data.max_response_length=512 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=/root/models/Qwen2.5-3B-Instruct \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model path is hardcoded to /root/models/Qwen2.5-3B-Instruct. This makes the example script not portable and difficult for other users to run without modification. It's better to use an environment variable for the model path to make the script more generic and easier to use.

For example, you could add MODEL_PATH=${MODEL_PATH:-/path/to/your/model} at the top of the script and then use $MODEL_PATH here.

Suggested change
actor_rollout_ref.model.path=/root/models/Qwen2.5-3B-Instruct \
actor_rollout_ref.model.path=${MODEL_PATH} \

@ISEEKYAN ISEEKYAN marked this pull request as draft February 27, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant