Skip to content

[perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm)#5391

Open
davidmlw wants to merge 14 commits intoverl-project:mainfrom
joyang-nv:liweim/nsys
Open

[perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm)#5391
davidmlw wants to merge 14 commits intoverl-project:mainfrom
joyang-nv:liweim/nsys

Conversation

@davidmlw
Copy link
Collaborator

What does this PR do?

verl rollout has turned to server mode and needs new Nsight scheme. This PR coworks with NVIDIA/TensorRT-LLM#11493.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

see verl/examples/grpo_trainer/run_qwen2-7b_math_trtllm_nsys.sh

Design & Code Changes

route nsight options to trtllm rollout server, use start/stop profile to control trtllm profiling ranges.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

rollout ranks first ok

rollout ranks first ok

run pre-commit

add docs

run pre-commit
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Nsight profiling support for the trtllm rollout server mode, involving updates to Docker configurations, documentation, and profiling logic. My review has identified two critical issues. First, in the Dockerfile, environment variables like LD_LIBRARY_PATH are set using export within a RUN command, which means they won't persist at runtime, likely causing library loading failures. Second, the profiling logic in ray_trainer.py can lead to nested or unbalanced profiler start/stop calls when using the REMAX advantage estimator, which will cause errors. Addressing these issues is crucial for the stability and correctness of the new profiling feature.

Comment on lines 19 to 21
export NVSHMEM_DIR=/usr/local/lib/python3.12/dist-packages/nvidia/nvshmem && \
export LD_LIBRARY_PATH="${NVSHMEM_DIR}/lib:$LD_LIBRARY_PATH" && \
export PATH="${NVSHMEM_DIR}/bin:$PATH" && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The export commands only set environment variables for the current RUN layer and will not persist in the final container image's environment. This means LD_LIBRARY_PATH will not include the nvshmem library path at runtime, which can lead to "library not found" errors when the application tries to load shared libraries from that path. You should use the ENV instruction to set environment variables that are required at runtime to ensure they are available to the container's processes.

@davidmlw davidmlw changed the title [BREAKING][perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm) [perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm) Feb 27, 2026
@davidmlw
Copy link
Collaborator Author

@wuxibin89 @Superjomn @hchings please have a review.
I passed most of the CI. the remaining tests passed locally.
Currently it depends on trtllm to update APIs to works. (only three lines, I comment as TODOs).
@wuxibin89 please review the modifications to infra. @Superjomn please review the Trtllm API updates. @hchings please review the trtllm unittest.

if profiling_replica_ranks
else []
)
print(f"david: profiling_replicas: {self.profiling_replicas}")
Copy link
Collaborator

@tardis-key tardis-key Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove print for debugging. If necessary, format your output

enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}

# Whether to profile all replicas.
all_replicas: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default configuration does not collect any replicas. However, the original code logic was to collect all replicas by default. The default configuration should be consistent with the logic before the modification, so it is recommended to set all_replicas=True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I set the Actor.yaml all_ranks: True. By default profile all resources, and user select items and set all to false.

@tardis-key
Copy link
Collaborator

#5215
This PR will check the profiling functionality and the necessary input/output contents in the NPU CI. It is recommended to wait for the merge of this CI check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants