[tool] feature: scheduling analysis based on profiling data for torch profiler#5367
[tool] feature: scheduling analysis based on profiling data for torch profiler#5367Rhetee wants to merge 7 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
The pull request introduces a new feature for scheduling analysis based on profiling data for NVTX, alongside existing MSTX support. It includes new parser and visualizer modules, as well as documentation for the new feature. The code is generally well-structured and includes unit tests for the new components. The documentation is clear and provides good examples. I've identified a few areas for improvement regarding error handling, logging, and consistency in the mstx_parser.py and nvtx_parser.py files.
| logger.warning(f"Rank {rank_id}: No rollout events found in json") | ||
| return events |
There was a problem hiding this comment.
The process_id variable is initialized to None and then assigned within a loop. If the loop completes without finding "Overlap Analysis", process_id remains None, which is then checked. This is fine, but the start_ids and end_ids are also initialized to None and then updated within a subsequent loop. If the process_id is not found, the second loop is skipped, and start_ids and end_ids will remain None, leading to the warning on line 147. This is correct behavior, but the warning on line 103 is redundant as the subsequent check for process_id is None on line 115 already covers this case. Consider removing the warning on line 103 to avoid duplicate logging for the same underlying issue.
| logger.warning(f"Rank {rank_id}: Overlap Analysis process not found in json") | ||
| return events |
There was a problem hiding this comment.
The warning message here is slightly misleading. If process_id is None, it means "Overlap Analysis" was not found, not necessarily that the process itself was not found. Consider rephrasing for clarity.
| logger.warning(f"Rank {rank_id}: Overlap Analysis process not found in json") | |
| return events | |
| logger.warning(f"Rank {rank_id}: 'Overlap Analysis' entry not found in json") |
| if "ts" not in row or "dur" not in row: | ||
| logger.warning("Row missing required fields: ts or dur. Skipping row.") |
There was a problem hiding this comment.
The args variable is checked for isinstance(args, dict) on line 123. If it's not a dict, the loop continues. However, the args variable is not used after this check within this loop. This check seems misplaced or unnecessary if args is not used. If args is intended to be used later, it should be within the scope of the if isinstance(args, dict): block.
| if start_ids is None or start_time_ns < start_ids: | ||
| start_ids = start_time_ns | ||
| if end_ids is None or end_time_ns > end_ids: | ||
| end_ids = end_time_ns | ||
|
|
||
| except (ValueError, TypeError) as e: | ||
| logger.warning(f"Failed to convert time values: {e}. Row data: {row}. Skipping row.") | ||
| continue |
There was a problem hiding this comment.
The logic for updating start_ids and end_ids can be simplified using min and max functions, which are more Pythonic and often more readable. This also helps in handling the initial None values more cleanly.
| if start_ids is None or start_time_ns < start_ids: | |
| start_ids = start_time_ns | |
| if end_ids is None or end_time_ns > end_ids: | |
| end_ids = end_time_ns | |
| except (ValueError, TypeError) as e: | |
| logger.warning(f"Failed to convert time values: {e}. Row data: {row}. Skipping row.") | |
| continue | |
| start_time_ns = float(row["ts"]) | |
| duration_ns = float(row["dur"]) | |
| end_time_ns = start_time_ns + duration_ns | |
| start_ids = min(start_ids, start_time_ns) if start_ids is not None else start_time_ns | |
| end_ids = max(end_ids, end_time_ns) if end_ids is not None else end_time_ns |
| if self._rank_list != "all": | ||
| logger.error("RL analysis currently only supports processing all ranks") | ||
| return [] |
There was a problem hiding this comment.
The error message here states that "RL analysis currently only supports processing all ranks". However, the _rank_list attribute is already set based on the rank-list argument. If rank-list is not "all", this function will return an empty list, effectively preventing any analysis. This is a high-severity issue because it indicates a discrepancy between the intended functionality (supporting specific ranks) and the current implementation, which explicitly disallows it. If specific rank processing is not supported, the argument parsing should reflect that, or the error message should be more precise about the limitation.
| generate_rl_timeline(data, output_path) | ||
| print("in html") |
| @register_cluster_visualizer("chart") | ||
| def cluster_visualizer_chart(data: pd.DataFrame, output_path: str, config: dict) -> None: |
| raise ValueError(f"input_data: {input_data} is None!") | ||
|
|
There was a problem hiding this comment.
The error message input_data: {input_data} is None! is a bit redundant. It's clear that input_data is None from the condition. A more concise message would be Input data cannot be None.
| raise ValueError(f"input_data: {input_data} is None!") | |
| raise ValueError("Input data cannot be None!") |
| [ | ||
| { | ||
| "Start": short["Start"].min(), | ||
| "Finish": short["Finish"].max(), | ||
| "Role": short.iloc[0]["Role"], | ||
| "Rank ID": short.iloc[0]["Rank ID"], | ||
| "Name": short.iloc[0]["Name"], | ||
| "Duration": short["Finish"].max() - short["Start"].min(), |
There was a problem hiding this comment.
The merged DataFrame is created with Name and Role taken from short.iloc[0]. This assumes that all short events within a group (Role, Rank ID, Name) have the same Name and Role, which is true by the groupby key. However, the Name of the merged event should ideally reflect that it's a consolidation of multiple short events, rather than just taking the name of the first one. This could lead to confusion in the visualization if the original Name is important. Consider using a generic name like "Merged Short Events" or appending a suffix to the original name.
| return int(label.split(" - Rank ")[-1]) | ||
| except Exception: | ||
| return float("inf") |
There was a problem hiding this comment.
The _extract_rank function uses float("inf") for cases where the rank cannot be extracted. While this works for sorting, it might be more robust to handle such cases explicitly, perhaps by logging a warning or assigning a default rank if the format is not as expected. This is a high-severity issue because it can lead to unexpected sorting behavior if the Y_Label format deviates from "Role - Rank ID".
What does this PR do?
This PR relies on #5248 and will be modified once it has been merged.
This PR adds the torch_parser for Cluster Analysis.
The torch_parser.py module is designed to parse and process PyTorch Profiler data for cluster analysis. By inheriting from BaseClusterParser, the torch_parser defines the overall workflow for data processing: allocate_prof_data() and parse_analysis_data().
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Due to time dilation introduced during profiling data collection, the intervals between stages in the visualization appear longer than in the actual execution. However, the duration of each individual stage remains accurate.
The following results compare the execution time of two steps with profiling enabled and disabled during data collection.
Role
Step with profiling
Step w/o profiling
timing_s/update_actor
12.36
6.99
timing_s/old_log_prob
3.24
1.99
timing_s/gen
37.85
5.38
time_per_step
62.70
24.03
API and Usage Example
Input Requirements (Default as torch_profiler in verl):
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.