[BREAKING][megatron] feat: support Megatron-FSDP as a new training backend#5423
[BREAKING][megatron] feat: support Megatron-FSDP as a new training backend#5423conver334 wants to merge 2 commits intoverl-project:mainfrom
Conversation
Signed-off-by: conver334 <conver334@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for Megatron-FSDP as a new training backend, including configuration flags, automatic DDP settings, and state lifecycle management. The overall implementation is sound, but I've identified a critical issue in the FSDP parameter synchronization logic that could lead to inconsistent model states during inference. Additionally, the new example script has a hardcoded model path, which impacts its portability. I've provided suggestions to fix these issues.
| for model_chunk in model_chunks: | ||
| fsdp = model_chunk.module | ||
| if getattr(fsdp, "data_parallel_sharding_strategy", None) == "optim_grads_params": | ||
| fsdp.synchronize_param_gather() | ||
| return True | ||
| return False |
There was a problem hiding this comment.
The function synchronize_megatron_fsdp_params returns immediately after synchronizing the first FSDP module it finds. If model_chunks can contain multiple FSDP-wrapped modules (e.g., with pipeline parallelism), this will result in only the first one being synchronized, potentially leading to inconsistent model states and silent correctness issues during inference. The function should iterate through all model chunks and synchronize all applicable FSDP modules before returning.
| for model_chunk in model_chunks: | |
| fsdp = model_chunk.module | |
| if getattr(fsdp, "data_parallel_sharding_strategy", None) == "optim_grads_params": | |
| fsdp.synchronize_param_gather() | |
| return True | |
| return False | |
| synchronized = False | |
| for model_chunk in model_chunks: | |
| fsdp = model_chunk.module | |
| if getattr(fsdp, "data_parallel_sharding_strategy", None) == "optim_grads_params": | |
| fsdp.synchronize_param_gather() | |
| synchronized = True | |
| return synchronized |
| data.max_response_length=512 \ | ||
| data.filter_overlong_prompts=True \ | ||
| data.truncation='error' \ | ||
| actor_rollout_ref.model.path=/root/models/Qwen2.5-3B-Instruct \ |
There was a problem hiding this comment.
The model path is hardcoded to /root/models/Qwen2.5-3B-Instruct. This makes the example script not portable and difficult for other users to run without modification. It's better to use an environment variable for the model path to make the script more generic and easier to use.
For example, you could add MODEL_PATH=${MODEL_PATH:-/path/to/your/model} at the top of the script and then use $MODEL_PATH here.
| actor_rollout_ref.model.path=/root/models/Qwen2.5-3B-Instruct \ | |
| actor_rollout_ref.model.path=${MODEL_PATH} \ |
What does this PR do?
Add Megatron-FSDP as a new training backend option for the Megatron engine. This is implementation of #5244 .
Key changes:
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
[TODO]
API and Usage Example
Enable Megatron-FSDP by adding a single config flag:
python3 -m verl.trainer.main_ppo --config-path=config \ --config-name='ppo_megatron_trainer.yaml' \ actor_rollout_ref.actor.megatron.use_mbridge=True \ actor_rollout_ref.actor.megatron.vanilla_mbridge=True \ actor_rollout_ref.actor.megatron.use_megatron_fsdp=True \The FSDP-specific DDP settings (sharding strategy, overlap, etc.) are auto-configured with defaults. Advanced users can override them:
actor_rollout_ref.actor.megatron.override_ddp_config.data_parallel_sharding_strategy=optim_grads \ actor_rollout_ref.actor.megatron.override_ddp_config.overlap_grad_reduce=False \Design & Code Changes
Megatron-FSDP use the same training loop as Megatron.
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.