[tool] feature: scheduling analysis based on profiling data for torch profiler by Rhetee · Pull Request #5367 · verl-project/verl

Rhetee · 2026-02-22T09:40:32Z

What does this PR do?

This PR relies on #5248 and will be modified once it has been merged.
This PR adds the torch_parser for Cluster Analysis.
The torch_parser.py module is designed to parse and process PyTorch Profiler data for cluster analysis. By inheriting from BaseClusterParser, the torch_parser defines the overall workflow for data processing: allocate_prof_data() and parse_analysis_data().

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: [tool] feature: scheduling analysis based on profiling data #5248
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

Due to time dilation introduced during profiling data collection, the intervals between stages in the visualization appear longer than in the actual execution. However, the duration of each individual stage remains accurate.

The following results compare the execution time of two steps with profiling enabled and disabled during data collection.

Role	Step with profiling	Step w/o profiling
timing_s/update_actor	12.36	6.99
timing_s/old_log_prob	3.24	1.99
timing_s/gen	37.85	5.38
time_per_step	62.70	24.03

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python cluster_analysis.py --input-path ./data --output-path ./output --profiler-type torch

Input Requirements (Default as torch_profiler in verl):

PyTorch Profiler JSON.gz files (ending with .json.gz )
Files organized in directories by role (e.g., update_actor)

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.
verl/tools/cluster_analysis/torch_parser.py : +169 lines (new)

Implements TorchClusterParser class inheriting from BaseClusterParser
Provides allocate_prof_data() for file discovery and organization
Provides parse_analysis_data() for parsing and event extraction
Includes helper methods for data mapping and rank path generation

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

gemini-code-assist

Code Review

The pull request introduces a new feature for scheduling analysis based on profiling data for NVTX, alongside existing MSTX support. It includes new parser and visualizer modules, as well as documentation for the new feature. The code is generally well-structured and includes unit tests for the new components. The documentation is clear and provides good examples. I've identified a few areas for improvement regarding error handling, logging, and consistency in the mstx_parser.py and nvtx_parser.py files.

gemini-code-assist · 2026-02-22T09:42:58Z

verl/tools/cluster_analysis/mstx_parser.py

+            logger.warning(f"Rank {rank_id}: No rollout events found in json")
+            return events


The process_id variable is initialized to None and then assigned within a loop. If the loop completes without finding "Overlap Analysis", process_id remains None, which is then checked. This is fine, but the start_ids and end_ids are also initialized to None and then updated within a subsequent loop. If the process_id is not found, the second loop is skipped, and start_ids and end_ids will remain None, leading to the warning on line 147. This is correct behavior, but the warning on line 103 is redundant as the subsequent check for process_id is None on line 115 already covers this case. Consider removing the warning on line 103 to avoid duplicate logging for the same underlying issue.

gemini-code-assist · 2026-02-22T09:42:58Z

verl/tools/cluster_analysis/mstx_parser.py

+            logger.warning(f"Rank {rank_id}: Overlap Analysis process not found in json")
+            return events


The warning message here is slightly misleading. If process_id is None, it means "Overlap Analysis" was not found, not necessarily that the process itself was not found. Consider rephrasing for clarity.

Suggested change

logger.warning(f"Rank {rank_id}: Overlap Analysis process not found in json")

return events

logger.warning(f"Rank {rank_id}: 'Overlap Analysis' entry not found in json")

gemini-code-assist · 2026-02-22T09:42:58Z

verl/tools/cluster_analysis/mstx_parser.py

+            if "ts" not in row or "dur" not in row:
+                logger.warning("Row missing required fields: ts or dur. Skipping row.")


The args variable is checked for isinstance(args, dict) on line 123. If it's not a dict, the loop continues. However, the args variable is not used after this check within this loop. This check seems misplaced or unnecessary if args is not used. If args is intended to be used later, it should be within the scope of the if isinstance(args, dict): block.

gemini-code-assist · 2026-02-22T09:42:58Z

verl/tools/cluster_analysis/mstx_parser.py

+                if start_ids is None or start_time_ns < start_ids:
+                    start_ids = start_time_ns
+                if end_ids is None or end_time_ns > end_ids:
+                    end_ids = end_time_ns
+
+            except (ValueError, TypeError) as e:
+                logger.warning(f"Failed to convert time values: {e}. Row data: {row}. Skipping row.")
+                continue


The logic for updating start_ids and end_ids can be simplified using min and max functions, which are more Pythonic and often more readable. This also helps in handling the initial None values more cleanly.

Suggested change

if start_ids is None or start_time_ns < start_ids:

start_ids = start_time_ns

if end_ids is None or end_time_ns > end_ids:

end_ids = end_time_ns

except (ValueError, TypeError) as e:

logger.warning(f"Failed to convert time values: {e}. Row data: {row}. Skipping row.")

continue

start_time_ns = float(row["ts"])

duration_ns = float(row["dur"])

end_time_ns = start_time_ns + duration_ns

start_ids = min(start_ids, start_time_ns) if start_ids is not None else start_time_ns

end_ids = max(end_ids, end_time_ns) if end_ids is not None else end_time_ns

gemini-code-assist · 2026-02-22T09:42:58Z

verl/tools/cluster_analysis/mstx_parser.py

+        if self._rank_list != "all":
+            logger.error("RL analysis currently only supports processing all ranks")
+            return []


The error message here states that "RL analysis currently only supports processing all ranks". However, the _rank_list attribute is already set based on the rank-list argument. If rank-list is not "all", this function will return an empty list, effectively preventing any analysis. This is a high-severity issue because it indicates a discrepancy between the intended functionality (supporting specific ranks) and the current implementation, which explicitly disallows it. If specific rank processing is not supported, the argument parsing should reflect that, or the error message should be more precise about the limitation.

gemini-code-assist · 2026-02-22T09:42:59Z