Skip to content

Commit 5f7c345

Browse files
authored
[perf] fix: the overwritten of Torch_profile with multi steps. (#5395)
### What does this PR do? > When using PyTorch Profiler for multi-step performance analysis, the performance analysis files generated by different steps will overwrite each other in discrete mode, resulting in only the analysis data from the last step being retained. This PR fix this bug by adding timestamp in the file_name. Issue is: #5387. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. Effect: <img width="326" height="99" alt="torch_profile_pic3" src="https://github.com/user-attachments/assets/7524f1dd-82ac-45b7-82fc-50a2b987591b" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this # Add time stampe in save_file_name current_time = datetime.now(tz=timezone.utc).astimezone() timestamp = current_time.strftime("%Y%m%d%H%M%S%f")[:-3] ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
1 parent b5979db commit 5f7c345

File tree

1 file changed

+6
-1
lines changed

1 file changed

+6
-1
lines changed

verl/utils/profiler/torch_profile.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414

1515
import functools
1616
import os
17+
from datetime import datetime, timezone
1718
from typing import Callable, Optional
1819

1920
import torch
@@ -34,7 +35,11 @@ def get_torch_profiler(
3435

3536
os.makedirs(save_path, exist_ok=True)
3637

37-
save_file_name = f"prof_rank-{rank}.json.gz"
38+
current_time = datetime.now(tz=timezone.utc).astimezone()
39+
timestamp = current_time.strftime("%Y%m%d%H%M%S%f")[:-3]
40+
pid = os.getpid()
41+
42+
save_file_name = f"prof_rank-{rank}_{pid}_{timestamp}.json.gz"
3843
if save_file_prefix:
3944
save_file_name = f"{save_file_prefix}_{save_file_name}"
4045
save_path = os.path.join(save_path, save_file_name)

0 commit comments

Comments
 (0)