Skip to content

Commit 8e8deed

Browse files
committed
chore: tune vLLM rollout memory for single-node
- raise vLLM gpu_memory_utilization to 0.30 for KV cache - lower rollout.n and cap max batched tokens for stability - apply settings to both Megatron and FSDP single-node scripts
1 parent 56ba579 commit 8e8deed

File tree

2 files changed

+6
-4
lines changed

2 files changed

+6
-4
lines changed

recipes_custom/RLVR_ABCDE_dense/run_grpo_fsdp_single_node.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,9 @@ python3 $ENTRYPOINT --config-path=/llm-align/liuchonghan/verl_lao/verl/trainer/c
5858
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
5959
actor_rollout_ref.rollout.name=vllm \
6060
actor_rollout_ref.rollout.mode=$rollout_mode \
61-
actor_rollout_ref.rollout.gpu_memory_utilization=0.25 \
62-
actor_rollout_ref.rollout.n=4 \
61+
actor_rollout_ref.rollout.gpu_memory_utilization=0.30 \
62+
actor_rollout_ref.rollout.n=2 \
63+
actor_rollout_ref.rollout.max_num_batched_tokens=4096 \
6364
actor_rollout_ref.ref.fsdp_config.fsdp_size=$FSDP_SIZE \
6465
actor_rollout_ref.ref.fsdp_config.param_offload=$REF_OFFLOAD \
6566
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \

recipes_custom/RLVR_ABCDE_dense/run_grpo_megatron_single_node.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,9 @@ python3 $ENTRYPOINT --config-path=/llm-align/liuchonghan/verl_lao/verl/trainer/c
5353
actor_rollout_ref.rollout.tensor_model_parallel_size=$TP_SIZE \
5454
actor_rollout_ref.rollout.name=vllm \
5555
actor_rollout_ref.rollout.mode=$rollout_mode \
56-
actor_rollout_ref.rollout.gpu_memory_utilization=0.25 \
57-
actor_rollout_ref.rollout.n=4 \
56+
actor_rollout_ref.rollout.gpu_memory_utilization=0.30 \
57+
actor_rollout_ref.rollout.n=2 \
58+
actor_rollout_ref.rollout.max_num_batched_tokens=4096 \
5859
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
5960
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP_SIZE \
6061
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP_SIZE \

0 commit comments

Comments
 (0)