[perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm) by davidmlw · Pull Request #5391 · verl-project/verl

davidmlw · 2026-02-25T05:53:23Z

What does this PR do?

verl rollout has turned to server mode and needs new Nsight scheme. This PR coworks with NVIDIA/TensorRT-LLM#11493.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

see verl/examples/grpo_trainer/run_qwen2-7b_math_trtllm_nsys.sh

Design & Code Changes

route nsight options to trtllm rollout server, use start/stop profile to control trtllm profiling ranges.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

rollout ranks first ok rollout ranks first ok run pre-commit add docs run pre-commit

gemini-code-assist

Code Review

This pull request introduces Nsight profiling support for the trtllm rollout server mode, involving updates to Docker configurations, documentation, and profiling logic. My review has identified two critical issues. First, in the Dockerfile, environment variables like LD_LIBRARY_PATH are set using export within a RUN command, which means they won't persist at runtime, likely causing library loading failures. Second, the profiling logic in ray_trainer.py can lead to nested or unbalanced profiler start/stop calls when using the REMAX advantage estimator, which will cause errors. Addressing these issues is crucial for the stability and correctness of the new profiling feature.

gemini-code-assist · 2026-02-25T05:57:15Z

docker/Dockerfile.stable.trtllm

+    export NVSHMEM_DIR=/usr/local/lib/python3.12/dist-packages/nvidia/nvshmem && \
+    export LD_LIBRARY_PATH="${NVSHMEM_DIR}/lib:$LD_LIBRARY_PATH" && \
+    export PATH="${NVSHMEM_DIR}/bin:$PATH" && \


The export commands only set environment variables for the current RUN layer and will not persist in the final container image's environment. This means LD_LIBRARY_PATH will not include the nvshmem library path at runtime, which can lead to "library not found" errors when the application tries to load shared libraries from that path. You should use the ENV instruction to set environment variables that are required at runtime to ensure they are available to the container's processes.

davidmlw · 2026-02-27T05:47:03Z

@wuxibin89 @Superjomn @hchings please have a review.
I passed most of the CI. the remaining tests passed locally.
Currently it depends on trtllm to update APIs to works. (only three lines, I comment as TODOs).
@wuxibin89 please review the modifications to infra. @Superjomn please review the Trtllm API updates. @hchings please review the trtllm unittest.

tardis-key · 2026-02-27T09:09:44Z

verl/experimental/agent_loop/agent_loop.py

+            if profiling_replica_ranks
+            else []
+        )
+        print(f"david: profiling_replicas: {self.profiling_replicas}")


Remove print for debugging. If necessary, format your output

tardis-key · 2026-02-27T09:20:31Z

verl/trainer/config/rollout/rollout.yaml

  enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}

+  # Whether to profile all replicas.
+  all_replicas: false


This default configuration does not collect any replicas. However, the original code logic was to collect all replicas by default. The default configuration should be consistent with the logic before the modification, so it is recommended to set all_replicas=True.

Then I set the Actor.yaml all_ranks: True. By default profile all resources, and user select items and set all to false.

tardis-key · 2026-02-27T09:26:21Z

#5215
This PR will check the profiling functionality and the necessary input/output contents in the NPU CI. It is recommended to wait for the merge of this CI check.

davidmlw added 4 commits February 25, 2026 09:54

apply nsys

3ed6ba3

rollout ranks first ok rollout ranks first ok run pre-commit add docs run pre-commit

run pre-commit

c4ad789

add docs

06ea68c

add trtllm server mode example

579fa2f

davidmlw requested review from FightingZhen, PeterSH6, eric-haibin-lin, ji-huazhong, tardis-key, tongyx361 and vermouth1992 as code owners February 25, 2026 05:53

gemini-code-assist bot reviewed Feb 25, 2026

View reviewed changes

use rollout.profiler.global_tool_config

7dbe100

davidmlw requested review from wuxibin89 and zw0610 as code owners February 25, 2026 10:48

davidmlw added 2 commits February 26, 2026 11:58

add dummy shutdown; add fields to NsightToolConfig

e129e9e

run pre-commit

0deda9b

davidmlw requested review from ArronHZG and chenhaiq as code owners February 26, 2026 04:01

davidmlw added 6 commits February 26, 2026 14:21

mute trtllm interface for ci

26a17c8

set default capture-range-end in megatron yaml

988588a

run pre-commit

1c7b9bd

global_tool_config.nsys has target class

afbcfea

better trtllm unittest ray.init and shutdown; add rollout config doc

1ee1fef

run pre-commit

1192d50

davidmlw changed the title ~~[BREAKING][perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm)~~ [perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm) Feb 27, 2026

tardis-key reviewed Feb 27, 2026

View reviewed changes

run pre-commit

b9b57d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm)#5391

[perf, trtllm] feat: Add Nsight support for rollout server mode (trtllm)#5391
davidmlw wants to merge 14 commits intoverl-project:mainfrom
joyang-nv:liweim/nsys

davidmlw commented Feb 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 25, 2026

Uh oh!

davidmlw commented Feb 27, 2026

Uh oh!

tardis-key Feb 27, 2026 •

edited

Loading

Uh oh!

tardis-key Feb 27, 2026

Uh oh!

davidmlw Feb 27, 2026

Uh oh!

tardis-key commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

davidmlw commented Feb 25, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

davidmlw commented Feb 27, 2026

Uh oh!

tardis-key Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tardis-key Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

davidmlw Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

tardis-key commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tardis-key Feb 27, 2026 •

edited

Loading