Skip to content

[tool] feature: scheduling analysis based on profiling data for torch profiler#5367

Draft
Rhetee wants to merge 7 commits intoverl-project:mainfrom
Rhetee:main
Draft

[tool] feature: scheduling analysis based on profiling data for torch profiler#5367
Rhetee wants to merge 7 commits intoverl-project:mainfrom
Rhetee:main

Conversation

@Rhetee
Copy link
Contributor

@Rhetee Rhetee commented Feb 22, 2026

What does this PR do?

This PR relies on #5248 and will be modified once it has been merged.
This PR adds the torch_parser for Cluster Analysis.
The torch_parser.py module is designed to parse and process PyTorch Profiler data for cluster analysis. By inheriting from BaseClusterParser, the torch_parser defines the overall workflow for data processing: allocate_prof_data() and parse_analysis_data().

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: [tool] feature: scheduling analysis based on profiling data #5248
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

profile_pic4

Due to time dilation introduced during profiling data collection, the intervals between stages in the visualization appear longer than in the actual execution. However, the duration of each individual stage remains accurate.

The following results compare the execution time of two steps with profiling enabled and disabled during data collection.

Role

Step with profiling

Step w/o profiling

timing_s/update_actor

12.36

6.99

timing_s/old_log_prob

3.24

1.99

timing_s/gen

37.85

5.38

time_per_step

62.70

24.03

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python cluster_analysis.py --input-path ./data --output-path ./output --profiler-type torch

Input Requirements (Default as torch_profiler in verl):

  • PyTorch Profiler JSON.gz files (ending with .json.gz )
  • Files organized in directories by role (e.g., update_actor)

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.
verl/tools/cluster_analysis/torch_parser.py : +169 lines (new)

  • Implements TorchClusterParser class inheriting from BaseClusterParser
  • Provides allocate_prof_data() for file discovery and organization
  • Provides parse_analysis_data() for parsing and event extraction
  • Includes helper methods for data mapping and rank path generation

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new feature for scheduling analysis based on profiling data for NVTX, alongside existing MSTX support. It includes new parser and visualizer modules, as well as documentation for the new feature. The code is generally well-structured and includes unit tests for the new components. The documentation is clear and provides good examples. I've identified a few areas for improvement regarding error handling, logging, and consistency in the mstx_parser.py and nvtx_parser.py files.

Comment on lines +103 to +104
logger.warning(f"Rank {rank_id}: No rollout events found in json")
return events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The process_id variable is initialized to None and then assigned within a loop. If the loop completes without finding "Overlap Analysis", process_id remains None, which is then checked. This is fine, but the start_ids and end_ids are also initialized to None and then updated within a subsequent loop. If the process_id is not found, the second loop is skipped, and start_ids and end_ids will remain None, leading to the warning on line 147. This is correct behavior, but the warning on line 103 is redundant as the subsequent check for process_id is None on line 115 already covers this case. Consider removing the warning on line 103 to avoid duplicate logging for the same underlying issue.

Comment on lines +115 to +116
logger.warning(f"Rank {rank_id}: Overlap Analysis process not found in json")
return events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The warning message here is slightly misleading. If process_id is None, it means "Overlap Analysis" was not found, not necessarily that the process itself was not found. Consider rephrasing for clarity.

Suggested change
logger.warning(f"Rank {rank_id}: Overlap Analysis process not found in json")
return events
logger.warning(f"Rank {rank_id}: 'Overlap Analysis' entry not found in json")

Comment on lines +127 to +128
if "ts" not in row or "dur" not in row:
logger.warning("Row missing required fields: ts or dur. Skipping row.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The args variable is checked for isinstance(args, dict) on line 123. If it's not a dict, the loop continues. However, the args variable is not used after this check within this loop. This check seems misplaced or unnecessary if args is not used. If args is intended to be used later, it should be within the scope of the if isinstance(args, dict): block.

Comment on lines +137 to +144
if start_ids is None or start_time_ns < start_ids:
start_ids = start_time_ns
if end_ids is None or end_time_ns > end_ids:
end_ids = end_time_ns

except (ValueError, TypeError) as e:
logger.warning(f"Failed to convert time values: {e}. Row data: {row}. Skipping row.")
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for updating start_ids and end_ids can be simplified using min and max functions, which are more Pythonic and often more readable. This also helps in handling the initial None values more cleanly.

Suggested change
if start_ids is None or start_time_ns < start_ids:
start_ids = start_time_ns
if end_ids is None or end_time_ns > end_ids:
end_ids = end_time_ns
except (ValueError, TypeError) as e:
logger.warning(f"Failed to convert time values: {e}. Row data: {row}. Skipping row.")
continue
start_time_ns = float(row["ts"])
duration_ns = float(row["dur"])
end_time_ns = start_time_ns + duration_ns
start_ids = min(start_ids, start_time_ns) if start_ids is not None else start_time_ns
end_ids = max(end_ids, end_time_ns) if end_ids is not None else end_time_ns

Comment on lines +196 to +198
if self._rank_list != "all":
logger.error("RL analysis currently only supports processing all ranks")
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error message here states that "RL analysis currently only supports processing all ranks". However, the _rank_list attribute is already set based on the rank-list argument. If rank-list is not "all", this function will return an empty list, effectively preventing any analysis. This is a high-severity issue because it indicates a discrepancy between the intended functionality (supporting specific ranks) and the current implementation, which explicitly disallows it. If specific rank processing is not supported, the argument parsing should reflect that, or the error message should be more precise about the limitation.

Comment on lines +62 to +63
generate_rl_timeline(data, output_path)
print("in html")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The print("in html") statement is likely a leftover from debugging. It should be removed in production code to avoid unnecessary console output.

Comment on lines +66 to +67
@register_cluster_visualizer("chart")
def cluster_visualizer_chart(data: pd.DataFrame, output_path: str, config: dict) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The print("in chart") statement is likely a leftover from debugging. It should be removed in production code to avoid unnecessary console output.

Comment on lines +114 to +115
raise ValueError(f"input_data: {input_data} is None!")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error message input_data: {input_data} is None! is a bit redundant. It's clear that input_data is None from the condition. A more concise message would be Input data cannot be None.

Suggested change
raise ValueError(f"input_data: {input_data} is None!")
raise ValueError("Input data cannot be None!")

Comment on lines +160 to +167
[
{
"Start": short["Start"].min(),
"Finish": short["Finish"].max(),
"Role": short.iloc[0]["Role"],
"Rank ID": short.iloc[0]["Rank ID"],
"Name": short.iloc[0]["Name"],
"Duration": short["Finish"].max() - short["Start"].min(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The merged DataFrame is created with Name and Role taken from short.iloc[0]. This assumes that all short events within a group (Role, Rank ID, Name) have the same Name and Role, which is true by the groupby key. However, the Name of the merged event should ideally reflect that it's a consolidation of multiple short events, rather than just taking the name of the first one. This could lead to confusion in the visualization if the original Name is important. Consider using a generic name like "Merged Short Events" or appending a suffix to the original name.

Comment on lines +200 to +202
return int(label.split(" - Rank ")[-1])
except Exception:
return float("inf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _extract_rank function uses float("inf") for cases where the rank cannot be extracted. While this works for sorting, it might be more robust to handle such cases explicitly, perhaps by logging a warning or assigning a default rank if the format is not as expected. This is a high-severity issue because it can lead to unexpected sorting behavior if the Y_Label format deviates from "Role - Rank ID".

@Rhetee Rhetee changed the title [tool] feature: scheduling analysis based on profiling data for nvtx [tool] feature: scheduling analysis based on profiling data for torch profiler Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants