[fsdp] fix: Support trust_remote_code during FSDP HugginFace checkpoint save#5200
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a crash during FSDP checkpoint saving for models requiring trust_remote_code=True by introducing logic to detect and pass this flag. However, a critical security concern arises as automatically enabling trust_remote_code bypasses an important security boundary, potentially allowing arbitrary code execution from malicious models during checkpoint saving. Furthermore, the new remote code detection logic has a potential critical issue that could lead to an AttributeError. The .pre-commit-config.yaml update is a good improvement.
|
About the security concern raised, we could cross-verify with My thinking was that if we get to the saving point for a model that required remote code, we can assume the user has enabled trust_remote_code for the actor model, otherwise training would have failed at model loading time. Let me know if you'd prefer the explicit approach though |
|
Hi @HollowMan6 who should I ping for a review here? Thanks! |
HollowMan6
left a comment
There was a problem hiding this comment.
I can help review here by myself :)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
fcc0a69 to
66a4f4d
Compare
66a4f4d to
45d80cb
Compare
|
New implementation added @HollowMan6 , let me know if you'd like some modification to |
HollowMan6
left a comment
There was a problem hiding this comment.
let me know if you'd like some modification to tests/special_distributed/test_fsdp_ckpt.py added to demonstrate the use of the new parameter in FSDPCheckpointManager
Feel free
There was a problem hiding this comment.
Pull request overview
Propagates the Hugging Face trust_remote_code setting into FSDP checkpoint save logic so models that rely on custom code can be re-instantiated from config during hf_model checkpoint export (fixing the failure reported in #5214). Also adjusts pre-commit’s Python compilation hook to avoid scanning virtualenv directories.
Changes:
- Thread
trust_remote_codefrom model config intoFSDPCheckpointManageracross FSDP workers/engines/trainers. - Pass
trust_remote_codetoAutoModel*.from_config(...)when building the temporary model used forhf_modelsave. - Exclude
.venv/venv/.gitfrom the pre-commitcompileallscan.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
verl/workers/fsdp_workers.py |
Passes trust_remote_code into FSDPCheckpointManager for actor/critic checkpointing. |
verl/workers/engine/veomni/transformer_impl.py |
Forwards trust_remote_code to checkpoint manager in VeOmni engine. |
verl/workers/engine/fsdp/transformer_impl.py |
Forwards trust_remote_code to checkpoint manager in FSDP engine. |
verl/utils/checkpoint/fsdp_checkpoint_manager.py |
Adds trust_remote_code parameter and uses it when instantiating the save-time HF model via from_config. |
verl/trainer/fsdp_sft_trainer.py |
Forwards trust_remote_code into checkpoint manager for SFT. |
verl/experimental/vla/fsdp_workers.py |
Forwards trust_remote_code into checkpoint manager for experimental VLA worker. |
.pre-commit-config.yaml |
Updates compileall hook to exclude env/git directories. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
HollowMan6
left a comment
There was a problem hiding this comment.
Thanks for your contribution! The changes now LGTM.
What does this PR do?
Fixes #5214
Ensures that trust_remote_code is passed correctly to
auto_model_cls.from_configwhen the model requires remote code during saving.While testing GRPO full fine-tuning of the
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16model I gotTo replicate the issue use an 80GB+ GPU and run
After applying the fix I'm able to save the model
Update .pre-commit config
pre-commit checks were failing because the python compilation check scans the entire directory including .venv which can include other python files which do not conform to the requirements. The PR add some exclusions to avoid this
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Despite my best efforts I'm not able to produce a unit test that detects this, I can only observe the issue while training with Ray.
I've written https://gist.github.com/thvasilo/76596a638440cd7f342ba4d23a2efb2e that replicates the process, but this does not trigger the error, my current assumption is an execution env discrepancy (I'm using torchrun to run the test that shares processes, my job fail when using Ray)
API and Usage Example
No API changes, we detect custom models from their contents
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.