Skip to content

Commit 5af5eed

Browse files
committed
Merge branch 'main' into feat/chunked_weight_update
Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com>
2 parents ec908ea + e3b187a commit 5af5eed

File tree

77 files changed

+538
-2396
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+538
-2396
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
/verl/workers/actor/megatron_actor.py @ISEEKYAN @vermouth1992
2121
/verl/workers/critic/megatron_critic.py @ISEEKYAN @vermouth1992
2222
/verl/workers/megatron_workers.py @ISEEKYAN @vermouth1992
23+
/verl/experimental @wuxibin89 @ArronHZG
2324

2425
/tests/single_controller @zw0610 @wuxibin89
2526
/tests/trainer @eric-haibin-lin @vermouth1992 @tongyx361 @PeterSH6

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
- [ ] Search for similar PRs. Paste at least one query link here: ...
88
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI)
9-
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`
9+
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off`
1010
- If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]`
1111
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
1212
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title.

.github/workflows/e2e_ascend.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,10 @@ jobs:
126126
ray stop --force
127127
export PYTHONPATH=$PYTHONPATH:/Megatron-LM
128128
USE_DIST_CKPT=True USE_DUMMY_MODEL=True DUMMY_MODEL_CONFIG_PATH=tests/special_e2e/ppo_trainer/expert_parallel/qwen3moe_minimal.json DUMMY_MODEL_PATH=$HOME/dist_ckpt/qwen3_30b_grpo_mindspeed bash tests/special_npu/run_qwen3_30b_grpo_mindspeed.sh
129+
- name: Running the E2E test with fully_async_policy algorithm (FSDP2)
130+
run: |
131+
ray stop --force
132+
bash tests/special_npu/run_fully_async_policy.sh
129133
130134
vlm_rl_job:
131135
if: github.repository_owner == 'verl-project'

.github/workflows/e2e_one_step_off_policy_ascend.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ on:
6868
# Entrypoints
6969
- ".github/workflows/e2e_one_step_off_policy_ascend.yml"
7070
- "examples/data_preprocess/gsm8k.py"
71-
- "tests/special_e2e/run_one_step_off_policy.sh"
71+
- "tests/special_npu/run_one_step_off_policy.sh"
7272

7373
# Cancel jobs on the same ref if a new one is triggered
7474
concurrency:
@@ -122,7 +122,7 @@ jobs:
122122
- name: Running the E2E test with one_step_off_policy algorithm (FSDP2)
123123
run: |
124124
ray stop --force
125-
bash tests/special_e2e/run_one_step_off_policy.sh
125+
bash tests/special_npu/run_one_step_off_policy.sh
126126
127127
# Test Megatron strategy
128128
e2e_one_step_off_policy_megatron_ascend:
@@ -167,4 +167,4 @@ jobs:
167167
run: |
168168
ray stop --force
169169
export PYTHONPATH=$PYTHONPATH:/Megatron-LM
170-
bash tests/special_e2e/run_one_step_off_policy.sh
170+
bash tests/special_npu/run_one_step_off_policy.sh

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88
**/playground
99
**/wandb
1010

11+
/pyrightconfig.json
12+
1113
# Byte-compiled / optimized / DLL files
1214
__pycache__/
1315
*.py[cod]

docker/Dockerfile.stable.vllm

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,9 @@ RUN pip install torch==2.9.1 torchvision torchaudio --index-url https://download
3232
RUN sed -i '/nvidia-cudnn-cu12/d' /usr/local/lib/python3.12/dist-packages/torch-2.9.1+cu129.dist-info/METADATA
3333
RUN pip install --no-deps --force-reinstall nvidia-cudnn-cu12==9.16.0.29
3434

35+
# NOTE: This installs the `vllm` source code in `/vllm`.
36+
# This might break the (based)pyright type checking. To fix it, add `/vllm` to `extraPaths` in `pyrightconfig.json`.
37+
# c.f. https://docs.basedpyright.com/latest/configuration/config-files/
3538
RUN git clone --depth 1 -b v0.12.0 https://github.com/vllm-project/vllm.git && \
3639
cd vllm && \
3740
find requirements -name "*.txt" -print0 | xargs -0 sed -i '/torch/d' && \

docs/advance/fully_async.md

Lines changed: 0 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -106,9 +106,6 @@ https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_rev
106106
| `async_training.trigger_parameter_sync_step` | Indicates how many local updates FullyAsyncTrainer performs before a parameter synchronization |
107107
| `async_training.staleness_threshold` | Freshness control |
108108
| `async_training.partial_rollout` | Whether to perform partial_rollout |
109-
| `async_training.checkpoint_engine.enable` | Whether to use checkpoint_engine for accelerating, default `True` |
110-
| `async_training.checkpoint_engine.overlap_broadcast_and_consume` | When use checkpoint_engine, whether to overlap broadcast and load_weights, default `False` |
111-
| `async_training.checkpoint_engine.device_buffer_size_M` | When use checkpoint_engine, the user-specific bucket size (MB), default `4096` |
112109
| `async_training.use_trainer_do_validate` | Whether use trainer node to do validate process, default `False` |
113110

114111
**Further Explanation:**
@@ -182,27 +179,6 @@ https://github.com/ArronHZG/verl-community/blob/main/docs/fully_async_policy_rev
182179
mode d
183180
(async stream pipeline with partial rollout), our implementation approximates `Areal's Decoupled PPO`.
184181

185-
* `async_training.checkpoint_engine.enable`
186-
187-
Enabling the checkpoint engine generally reduces synchronization time overhead by more than 60% compared to
188-
the original per-tensor parameter synchronization method. However, assembling buckets incurs additional
189-
temporary GPU memory overhead.
190-
191-
* `async_training.checkpoint_engine.overlap_broadcast_and_consume`
192-
193-
Enabling pipeline between the broadcast and load_weights parameters will allocate additional GPU memory.
194-
Since the main time consumption for parameter synchronization is not in the broadcast and load_weights phases,
195-
but in the parameter generation phase (by megatron or FSDP), this option is off by default.
196-
197-
* `async_training.checkpoint_engine.device_buffer_size_M`
198-
199-
It controls the size of the memory buffer used for synchronization when the checkpoint-engine is enabled.
200-
The actual `bucket_size` = `max(device_buffer_size_M, maximum parameter tensor size)`.
201-
* When enable `overlap_broadcast_and_consume`, the additional device memory overhead of
202-
trainer rank is `3 * bucket_size`and rollout rank is `2 * bucket_size`
203-
* When disable `overlap_broadcast_and_consume`, the additional device memory overhead of
204-
trainer rank is `2 * bucket_size`and rollout rank is `1 * bucket_size`
205-
206182
* `async_training.use_trainer_do_validate`
207183

208184
It controls whether to use the trainer's `do_validate` method for validation.

requirements-npu.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ torchdata
1818
einops
1919
qwen_vl_utils
2020
hf_transfer
21-
triton-ascend==3.2.0rc4
21+
triton-ascend==3.2.0

tests/checkpoint_engine/test_correctness_on_gpu.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,14 @@
2323
split_resource_pool,
2424
)
2525
from verl.utils.device import get_device_name
26+
from verl.utils.ray_utils import auto_await
2627
from verl.workers.config import CheckpointEngineConfig, HFModelConfig, RolloutConfig
2728

2829

2930
@pytest.mark.asyncio
3031
@pytest.mark.parametrize("rebuild_group", [False, True])
3132
@pytest.mark.parametrize("num_trainer, num_rollout", [(2, 6)])
33+
@auto_await
3234
async def test_nccl_checkpoint_engine(
3335
rebuild_group,
3436
num_trainer,
@@ -65,7 +67,7 @@ async def test_nccl_checkpoint_engine(
6567
rollout, replicas = await create_rollout_worker_group(rollout_pool, model_config, rollout_config, check_allclose)
6668

6769
# create checkpoint engine manager
68-
checkpoint_manager = CheckpointEngineManager(backend="nccl", trainer=trainer, replicas=replicas)
70+
checkpoint_manager = CheckpointEngineManager(config=checkpoint_engine_config, trainer=trainer, replicas=replicas)
6971
for _ in range(3):
7072
await checkpoint_manager.update_weights()
7173
rollout.check_weights()
@@ -77,6 +79,7 @@ async def test_nccl_checkpoint_engine(
7779
@pytest.mark.asyncio
7880
@pytest.mark.parametrize("device", ["cuda", "cpu"])
7981
@pytest.mark.parametrize("num_trainer, num_rollout", [(2, 6)])
82+
@auto_await
8083
async def test_nixl_checkpoint_engine(
8184
num_trainer,
8285
num_rollout,
@@ -120,7 +123,7 @@ async def test_nixl_checkpoint_engine(
120123
rollout, replicas = await create_rollout_worker_group(rollout_pool, model_config, rollout_config, check_allclose)
121124

122125
# create checkpoint engine manager
123-
checkpoint_manager = CheckpointEngineManager(backend="nixl", trainer=trainer, replicas=replicas)
126+
checkpoint_manager = CheckpointEngineManager(config=checkpoint_engine_config, trainer=trainer, replicas=replicas)
124127
for _ in range(3):
125128
await checkpoint_manager.update_weights()
126129
rollout.check_weights()
@@ -132,6 +135,7 @@ async def test_nixl_checkpoint_engine(
132135
@pytest.mark.asyncio
133136
@pytest.mark.parametrize("rebuild_group", [False])
134137
@pytest.mark.parametrize("num_trainer, num_rollout", [(2, 6)])
138+
@auto_await
135139
async def test_kimi_checkpoint_engine(
136140
rebuild_group,
137141
num_trainer,

tests/checkpoint_engine/test_correctness_on_npu.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,14 @@
2323
split_resource_pool,
2424
)
2525
from verl.utils.device import get_device_name
26+
from verl.utils.ray_utils import auto_await
2627
from verl.workers.config import CheckpointEngineConfig, HFModelConfig, RolloutConfig
2728

2829

2930
@pytest.mark.asyncio
3031
@pytest.mark.parametrize("rebuild_group", [False])
3132
@pytest.mark.parametrize("num_trainer, num_rollout", [(2, 6)])
33+
@auto_await
3234
async def test_hccl_checkpoint_engine(
3335
rebuild_group,
3436
num_trainer,
@@ -66,7 +68,7 @@ async def test_hccl_checkpoint_engine(
6668
rollout, replicas = await create_rollout_worker_group(rollout_pool, model_config, rollout_config, check_allclose)
6769

6870
# create checkpoint engine manager
69-
checkpoint_manager = CheckpointEngineManager(backend="hccl", trainer=trainer, replicas=replicas)
71+
checkpoint_manager = CheckpointEngineManager(config=checkpoint_engine_config, trainer=trainer, replicas=replicas)
7072
for _ in range(3):
7173
await checkpoint_manager.update_weights()
7274
rollout.check_weights()

0 commit comments

Comments
 (0)