[perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm)#5391
[perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm)#5391davidmlw wants to merge 14 commits intoverl-project:mainfrom
Conversation
rollout ranks first ok rollout ranks first ok run pre-commit add docs run pre-commit
There was a problem hiding this comment.
Code Review
This pull request introduces Nsight profiling support for the trtllm rollout server mode, involving updates to Docker configurations, documentation, and profiling logic. My review has identified two critical issues. First, in the Dockerfile, environment variables like LD_LIBRARY_PATH are set using export within a RUN command, which means they won't persist at runtime, likely causing library loading failures. Second, the profiling logic in ray_trainer.py can lead to nested or unbalanced profiler start/stop calls when using the REMAX advantage estimator, which will cause errors. Addressing these issues is crucial for the stability and correctness of the new profiling feature.
docker/Dockerfile.stable.trtllm
Outdated
| export NVSHMEM_DIR=/usr/local/lib/python3.12/dist-packages/nvidia/nvshmem && \ | ||
| export LD_LIBRARY_PATH="${NVSHMEM_DIR}/lib:$LD_LIBRARY_PATH" && \ | ||
| export PATH="${NVSHMEM_DIR}/bin:$PATH" && \ |
There was a problem hiding this comment.
The export commands only set environment variables for the current RUN layer and will not persist in the final container image's environment. This means LD_LIBRARY_PATH will not include the nvshmem library path at runtime, which can lead to "library not found" errors when the application tries to load shared libraries from that path. You should use the ENV instruction to set environment variables that are required at runtime to ensure they are available to the container's processes.
|
@wuxibin89 @Superjomn @hchings please have a review. |
| if profiling_replica_ranks | ||
| else [] | ||
| ) | ||
| print(f"david: profiling_replicas: {self.profiling_replicas}") |
There was a problem hiding this comment.
Remove print for debugging. If necessary, format your output
| enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false} | ||
|
|
||
| # Whether to profile all replicas. | ||
| all_replicas: false |
There was a problem hiding this comment.
This default configuration does not collect any replicas. However, the original code logic was to collect all replicas by default. The default configuration should be consistent with the logic before the modification, so it is recommended to set all_replicas=True.
There was a problem hiding this comment.
Then I set the Actor.yaml all_ranks: True. By default profile all resources, and user select items and set all to false.
|
#5215 |
What does this PR do?
verl rollout has turned to server mode and needs new Nsight scheme. This PR coworks with NVIDIA/TensorRT-LLM#11493.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
see
verl/examples/grpo_trainer/run_qwen2-7b_math_trtllm_nsys.shDesign & Code Changes
route nsight options to trtllm rollout server, use start/stop profile to control trtllm profiling ranges.
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.