Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
32ff542
chore: point recipe submodule to fork
khazic Feb 2, 2026
27e354b
feat: add custom Qwen3-30BA3B translate recipe
khazic Feb 2, 2026
3b2a456
Add RLVR_ABCDE_dense scripts
khazic Feb 3, 2026
d48f175
Merge branch 'verl-project:main' into main
khazic Feb 3, 2026
d0db924
Merge branch 'verl-project:main' into main
khazic Feb 4, 2026
8ef1cb8
chore: adjust GRPO launch scripts and trainer defaults
khazic Feb 5, 2026
3c3288c
feat: add single-node Megatron GRPO launcher
khazic Feb 5, 2026
c79bebe
chore: run single-node GRPO in W&B offline mode
khazic Feb 5, 2026
56ba579
chore: lower single-node GRPO memory footprint
khazic Feb 5, 2026
8e8deed
chore: tune vLLM rollout memory for single-node
khazic Feb 5, 2026
cfafe22
Update GRPO scripts for 4-node Ray
khazic Feb 6, 2026
4b005cb
Use Ray address for existing cluster
khazic Feb 6, 2026
787a9eb
Add Ray runtime_env for code import
khazic Feb 6, 2026
4f360e1
Set socket IFNAME and increase batch size
khazic Feb 6, 2026
5acfc87
Propagate env to Ray workers and adjust batch
khazic Feb 6, 2026
d41a157
Quote MASTER_PORT in Ray runtime env
khazic Feb 6, 2026
6fa835f
Add WANDB proxy env vars to RLVR scripts
khazic Feb 6, 2026
069746f
Remove WANDB proxy envs and keep offline mode
khazic Feb 6, 2026
afa0f41
Enable WANDB logging via proxy in RLVR scripts
khazic Feb 6, 2026
042c4c8
Increase max prompt length to 2048
khazic Feb 6, 2026
10df219
Merge branch 'verl-project:main' into main
khazic Feb 6, 2026
232e77a
Tune RLVR GRPO configs for LR decay and larger rollout batches
khazic Feb 6, 2026
2e10ab5
Align FSDP and Megatron rollout settings
khazic Feb 6, 2026
070f9f6
Lower vLLM GPU memory utilization to 0.35
khazic Feb 6, 2026
a9d0407
Reduce rollout memory pressure while keeping n=16
khazic Feb 6, 2026
8660497
Merge branch 'verl-project:main' into main
khazic Feb 9, 2026
06bf174
Align FSDP GRPO config and add Qwen3 recipes
khazic Feb 10, 2026
888ece5
Fix FSDP min_lr override
khazic Feb 10, 2026
470995b
Fix FSDP lr_decay_style override
khazic Feb 10, 2026
3de3493
Align FSDP Ray settings with Megatron
khazic Feb 10, 2026
2315104
Unset Ray socket env vars before launch
khazic Feb 10, 2026
ed816dc
RLVR_ABCDE_dense: 对齐 FSDP/Megatron 消融配置与多节点 checkpoint 路径
khazic Feb 10, 2026
e562bfb
RLVR: ray.init _temp_dir 指向 RAY_TMPDIR 避免 /tmp 磁盘配额不足
khazic Feb 10, 2026
21c9dd8
Fix ray address and master port in launch scripts
khazic Feb 10, 2026
7ab6ed6
Fix Ray env var types for master port
khazic Feb 10, 2026
fd018b6
Update RLVR launch scripts
khazic Feb 10, 2026
52fc39a
Set explicit Ray address for FSDP launch
khazic Feb 10, 2026
9b26af5
Update FSDP Ray head address
khazic Feb 10, 2026
11a2cd1
Shorten Ray temp and working dir paths
khazic Feb 10, 2026
34ad838
Avoid Ray working_dir packaging to shorten IPC paths
khazic Feb 10, 2026
b25034e
Use user-owned short paths for Ray temp and work dirs
khazic Feb 10, 2026
fff6f09
Move Ray temp and TMPDIR to /dev/shm
khazic Feb 10, 2026
568690f
Pass TMPDIR to Ray runtime env
khazic Feb 10, 2026
d736fa3
Set WANDB_DIR to shared path for Ray workers
khazic Feb 10, 2026
a7d7460
Disable Gloo IPv6 in RLVR launch scripts
khazic Feb 10, 2026
1e9e40b
Ensure GLOO_IPV6 is passed as string
khazic Feb 10, 2026
fb012a9
Quote GLOO_IPV6 for Ray runtime env
khazic Feb 10, 2026
23098c0
Fix FSDP optimizer overrides
khazic Feb 10, 2026
603824d
Fix Hydra overrides for FSDP optimizer
khazic Feb 10, 2026
b8fba05
Pass WANDB_API_KEY to Ray runtime env
khazic Feb 10, 2026
253fe3f
Add JSON-to-parquet converter for VERL SFT
khazic Feb 10, 2026
4b96b3b
Tune FSDP rollout weight-sync bucket
khazic Feb 10, 2026
2be47a4
Propagate proxy and tmp dirs to Ray env
khazic Feb 10, 2026
093ed14
Fix SFT Megatron lr scheduler steps
khazic Feb 10, 2026
cba9e5e
Add NO_PROXY for internal traffic
khazic Feb 10, 2026
3cee17d
Quote NO_PROXY for Hydra overrides
khazic Feb 10, 2026
70616b2
Force proxy env vars for Ray workers
khazic Feb 11, 2026
2cc92e9
recipes: drop ALL_PROXY from GRPO scripts
khazic Feb 11, 2026
2105fb4
debug
khazic Feb 11, 2026
7e5931c
Merge branch 'verl-project:main' into main
khazic Feb 11, 2026
86c5529
recipes: disable proxy and use wandb offline for GRPO
khazic Feb 11, 2026
8908e25
k
khazic Feb 11, 2026
435467f
recipes: set FSDP MASTER_ADDR default
khazic Feb 11, 2026
179feec
Merge branch 'verl-project:main' into main
khazic Feb 11, 2026
f73636c
Merge branch 'verl-project:main' into main
khazic Feb 24, 2026
b325cc9
Merge branch 'verl-project:main' into main
khazic Feb 26, 2026
6e4ce3d
chore: update custom training recipes
khazic Feb 26, 2026
c3890f8
chore: update grpo single node script
khazic Feb 26, 2026
0b29fd7
Merge branch 'verl-project:main' into main
khazic Feb 26, 2026
529e576
chore: clean formatting in qwen2.5 72b sft run script
khazic Feb 27, 2026
ae13dde
Merge branch 'verl-project:main' into main
khazic Feb 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[submodule "recipe"]
path = recipe
url = https://github.com/verl-project/verl-recipe.git
url = https://github.com/khazic/verl-recipe_lao.git
2 changes: 1 addition & 1 deletion recipe
Submodule recipe updated 129 files
110 changes: 110 additions & 0 deletions recipes_custom/Qwen2.5-72B-sft/run_sft_qwen2.5_72b_megatron_dlc.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#!/usr/bin/env bash
set -xeuo pipefail

ENTRYPOINT=${ENTRYPOINT:-"-m verl.trainer.sft_trainer"}
TRAIN_FILES=${TRAIN_FILES:-/mnt/data/liuchonghan/235b_dataset/merged_sft_with_messages.parquet}
TRAIN_BATCH_SIZE=${TRAIN_BATCH_SIZE:-256}
backend=${BACKEND:-megatron}
project_name=verl_sft_qwen2.5_72b
RESUME_MODE=disable # auto
MODEL_ID=${MODEL_ID:-/mnt/data/liuchonghan/Qwen2.5-72B-A064}
TOTAL_EPOCHS=${TOTAL_EPOCHS:-2}

SP_SIZE=${SP_SIZE:-1}
FSDP_SIZE=${FSDP_SIZE:-64}
FSDP_STRATEGY=${FSDP_STRATEGY:-"fsdp2"}

TP_SIZE=${TP_SIZE:-8}
PP_SIZE=${PP_SIZE:-1}
CP_SIZE=${CP_SIZE:-1}

PAD_MODE=${PAD_MODE:-no_padding}
USE_REMOVE_PADDING=${USE_REMOVE_PADDING:-True}

FSDP_ENGINE_CONFIG="
engine=${backend} \
optim=${backend} \
optim.lr=5e-6 \
optim.lr_warmup_steps_ratio=0.05 \
optim.weight_decay=0.1 \
optim.betas="[0.9,0.95]" \
optim.clip_grad=1.0 \
optim.min_lr_ratio=0.1 \
optim.warmup_style=cosine \
engine.ulysses_sequence_parallel_size=${SP_SIZE} \
engine.strategy=${FSDP_STRATEGY} \
engine.fsdp_size=${FSDP_SIZE}"

MEGATRON_ENGINE_CONFIG="
engine=${backend} \
optim=${backend} \
optim.lr=6e-6 \
optim.lr_warmup_steps_ratio=0.05 \
optim.weight_decay=0.1 \
optim.betas="[0.9,0.95]" \
optim.clip_grad=1.0 \
optim.lr_warmup_init=0 \
optim.lr_decay_style=cosine \
optim.min_lr=6e-7 \
engine.tensor_model_parallel_size=${TP_SIZE} \
engine.pipeline_model_parallel_size=${PP_SIZE} \
engine.context_parallel_size=${CP_SIZE}"

if [ "$backend" = "fsdp" ]; then
ENGINE_CONFIG="$FSDP_ENGINE_CONFIG"
echo "Using fsdp engine"
exp_name=qwen2.5-72b-dense-${backend}-${FSDP_STRATEGY}-sp${SP_SIZE}
else
ENGINE_CONFIG="$MEGATRON_ENGINE_CONFIG"
echo "Using megatron engine"
exp_name=qwen2.5-72b-dense-${backend}-tp${TP_SIZE}-pp${PP_SIZE}-cp${CP_SIZE}
fi

CKPT_HOME=${CKPT_HOME:-/mnt/data/liuchonghan/ckpt_verl/sft/${project_name}/${exp_name}}
NNODES=${WORLD_SIZE:-16}
NODE_RANK=${RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
MASTER_PORT=${MASTER_PORT:-23457}

echo ">>> 节点信息: RANK $NODE_RANK / WORLD_SIZE $NNODES"
echo ">>> 通信信息: MASTER $MASTER_ADDR : $MASTER_PORT"

if [ "$NODE_RANK" -eq 0 ]; then
mkdir -p "${CKPT_HOME}"
fi

export WANDB_MODE=offline
export NCCL_DEBUG=WARN
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export PYTHONPATH=${PYTHONPATH:-}:/mnt/data/liuchonghan/verl_lao

torchrun \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
--nproc-per-node=8 \
${ENTRYPOINT} \
data.train_files="${TRAIN_FILES}" \
data.train_batch_size=${TRAIN_BATCH_SIZE} \
data.max_length=2048 \
data.pad_mode=${PAD_MODE} \
data.truncation=right \
data.use_dynamic_bsz=True \
data.max_token_len_per_gpu=4096 \
data.messages_key=messages \
data.ignore_input_ids_mismatch=True \
model.path=$MODEL_ID \
model.use_remove_padding=${USE_REMOVE_PADDING} \
model.enable_gradient_checkpointing=True \
${ENGINE_CONFIG} \
trainer.test_freq=-1 \
trainer.save_freq=2000 \
'trainer.logger=[console]' \
trainer.project_name="${project_name}" \
trainer.experiment_name="${exp_name}" \
trainer.total_epochs=${TOTAL_EPOCHS} \
trainer.default_local_dir="${CKPT_HOME}" \
trainer.resume_mode=${RESUME_MODE} \
trainer.max_ckpt_to_keep=2 \
'checkpoint.save_contents=[model,optimizer,extra,hf_model]'
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
#!/usr/bin/env bash
set -xeuo pipefail

ENTRYPOINT=${ENTRYPOINT:-"-m verl.trainer.sft_trainer"}
TRAIN_FILES=${TRAIN_FILES:-/mnt/data/liuchonghan/235b_dataset/merged_sft_with_messages.parquet}
TRAIN_BATCH_SIZE=${TRAIN_BATCH_SIZE:-256}
backend=${BACKEND:-megatron}
project_name=verl_sft_235ba22b_2507
RESUME_MODE=disable
MODEL_ID=${MODEL_ID:-/mnt/data/liuchonghan/Qwen3-235B-A22B-Instruct-2507}
TOTAL_EPOCHS=${TOTAL_EPOCHS:-2}

SP_SIZE=${SP_SIZE:-1}
FSDP_SIZE=${FSDP_SIZE:-64}
FSDP_STRATEGY=${FSDP_STRATEGY:-"fsdp2"}

TP_SIZE=${TP_SIZE:-4}
PP_SIZE=${PP_SIZE:-1}
EP_SIZE=${EP_SIZE:-8}
VPP_SIZE=${VPP_SIZE:-null}
CP_SIZE=${CP_SIZE:-1}

PAD_MODE=${PAD_MODE:-no_padding}
USE_REMOVE_PADDING=${USE_REMOVE_PADDING:-True}

FSDP_ENGINE_CONFIG="
engine=${backend} \
optim=${backend} \
optim.lr=5e-6 \
optim.lr_warmup_steps_ratio=0.05 \
optim.weight_decay=0.1 \
optim.betas="[0.9,0.95]" \
optim.clip_grad=1.0 \
optim.min_lr_ratio=0.1 \
optim.warmup_style=cosine \
engine.ulysses_sequence_parallel_size=${SP_SIZE} \
engine.strategy=${FSDP_STRATEGY} \
engine.fsdp_size=${FSDP_SIZE}"

MEGATRON_ENGINE_CONFIG="
engine=${backend} \
optim=${backend} \
optim.lr=6e-6 \
optim.lr_warmup_steps_ratio=0.05 \
optim.weight_decay=0.1 \
optim.betas="[0.9,0.95]" \
optim.clip_grad=1.0 \
optim.lr_warmup_init=0 \
optim.lr_decay_style=cosine \
optim.min_lr=6e-7 \
engine.tensor_model_parallel_size=${TP_SIZE} \
engine.pipeline_model_parallel_size=${PP_SIZE} \
engine.expert_model_parallel_size=${EP_SIZE} \
engine.context_parallel_size=${CP_SIZE} \
engine.use_mbridge=True"

if [ "$backend" = "fsdp" ]; then
ENGINE_CONFIG="$FSDP_ENGINE_CONFIG"
echo "Using fsdp engine"
exp_name=nvidia-qwen3-235b-a22b-moe-${backend}-${FSDP_STRATEGY}-sp${SP_SIZE}
else
ENGINE_CONFIG="$MEGATRON_ENGINE_CONFIG"
echo "Using megatron engine"
exp_name=nvidia-qwen3-235b-a22b-moe-${backend}-tp${TP_SIZE}-pp${PP_SIZE}-ep${EP_SIZE}-vpp${VPP_SIZE}-cp${CP_SIZE}
fi

CKPT_HOME=${CKPT_HOME:-/mnt/data/liuchonghan/ckpt_verl/sft/${project_name}/${exp_name}}
NNODES=${WORLD_SIZE:-16}
NODE_RANK=${RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
MASTER_PORT=${MASTER_PORT:-23457}

echo ">>> 节点信息: RANK $NODE_RANK / WORLD_SIZE $NNODES"
echo ">>> 通信信息: MASTER $MASTER_ADDR : $MASTER_PORT"

if [ "$NODE_RANK" -eq 0 ]; then
mkdir -p "${CKPT_HOME}"
fi

export WANDB_MODE=offline
export NCCL_DEBUG=WARN
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export PYTHONPATH=${PYTHONPATH:-}:/mnt/data/liuchonghan/verl_lao

torchrun \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
--nproc-per-node=8 \
${ENTRYPOINT} \
data.train_files="${TRAIN_FILES}" \
data.train_batch_size=${TRAIN_BATCH_SIZE} \
data.max_length=1024 \
data.pad_mode=${PAD_MODE} \
data.truncation=right \
data.use_dynamic_bsz=True \
data.max_token_len_per_gpu=10240 \
data.messages_key=messages \
data.ignore_input_ids_mismatch=True \
model.path=$MODEL_ID \
model.use_remove_padding=${USE_REMOVE_PADDING} \
+model.override_config.router_dtype="float16" \
model.enable_gradient_checkpointing=True \
${ENGINE_CONFIG} \
trainer.test_freq=-1 \
trainer.save_freq=2000 \
'trainer.logger=[console]' \
trainer.project_name="${project_name}" \
trainer.experiment_name="${exp_name}" \
trainer.total_epochs=${TOTAL_EPOCHS} \
trainer.default_local_dir="${CKPT_HOME}" \
trainer.resume_mode=${RESUME_MODE} \
trainer.max_ckpt_to_keep=2 \
'checkpoint.save_contents=[model,optimizer,extra,hf_model]'
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
#!/usr/bin/env bash
set -xeuo pipefail

ENTRYPOINT=${ENTRYPOINT:-"-m verl.trainer.sft_trainer"}
TRAIN_FILES=${TRAIN_FILES:-/mnt/data/liuchonghan/translate_parquet/train_data.parquet}
backend=${BACKEND:-megatron}
project_name=verl_sft_translate_0109_aux
RESUME_MODE=disable
MODEL_ID=${MODEL_ID:-/mnt/data/liuchonghan/Qwen3-30B-A3B-Instruct-2507}

SP_SIZE=${SP_SIZE:-1}
FSDP_SIZE=${FSDP_SIZE:-64}
FSDP_STRATEGY=${FSDP_STRATEGY:-"fsdp2"}

TP_SIZE=${TP_SIZE:-4}
PP_SIZE=${PP_SIZE:-1}
EP_SIZE=${EP_SIZE:-8}
VPP_SIZE=${VPP_SIZE:-null}
CP_SIZE=${CP_SIZE:-1}

PAD_MODE=${PAD_MODE:-no_padding}
USE_REMOVE_PADDING=${USE_REMOVE_PADDING:-True}

FSDP_ENGINE_CONFIG="
engine=${backend} \
optim=${backend} \
optim.lr=5e-6 \
optim.lr_warmup_steps_ratio=0.05 \
optim.weight_decay=0.1 \
optim.betas="[0.9,0.95]" \
optim.clip_grad=1.0 \
optim.min_lr_ratio=0.1 \
optim.warmup_style=cosine \
engine.ulysses_sequence_parallel_size=${SP_SIZE} \
engine.strategy=${FSDP_STRATEGY} \
engine.fsdp_size=${FSDP_SIZE}"

MEGATRON_ENGINE_CONFIG="
engine=${backend} \
optim=${backend} \
optim.lr=5e-6 \
optim.lr_warmup_steps_ratio=0.05 \
optim.weight_decay=0.1 \
optim.betas="[0.9,0.95]" \
optim.clip_grad=1.0 \
optim.lr_warmup_init=0 \
optim.lr_decay_style=cosine \
optim.min_lr=5e-7 \
engine.tensor_model_parallel_size=${TP_SIZE} \
engine.pipeline_model_parallel_size=${PP_SIZE} \
engine.expert_model_parallel_size=${EP_SIZE} \
engine.context_parallel_size=${CP_SIZE} \
engine.use_mbridge=True \
+engine.override_transformer_config.moe_aux_loss_coeff=0.01 \
+engine.override_transformer_config.moe_z_loss_coeff=0.001 \
+engine.override_transformer_config.moe_router_load_balancing_type=aux_loss"

if [ "$backend" = "fsdp" ]; then
ENGINE_CONFIG="$FSDP_ENGINE_CONFIG"
echo "Using fsdp engine"
exp_name=nvidia-qwen3-30b-moe-${backend}-${FSDP_STRATEGY}-sp${SP_SIZE}
else
ENGINE_CONFIG="$MEGATRON_ENGINE_CONFIG"
echo "Using megatron engine"
exp_name=nvidia-qwen3-30b-moe-${backend}-tp${TP_SIZE}-pp${PP_SIZE}-ep${EP_SIZE}-vpp${VPP_SIZE}-cp${CP_SIZE}
fi

CKPT_HOME=${CKPT_HOME:-/mnt/data/liuchonghan/ckpt_verl/sft/${project_name}/${exp_name}}
NNODES=${WORLD_SIZE:-8}
NODE_RANK=${RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
MASTER_PORT=${MASTER_PORT:-23457}

echo ">>> 节点信息: RANK $NODE_RANK / WORLD_SIZE $NNODES"
echo ">>> 通信信息: MASTER $MASTER_ADDR : $MASTER_PORT"

if [ "$NODE_RANK" -eq 0 ]; then
mkdir -p "${CKPT_HOME}"
fi

export WANDB_MODE=offline
export NCCL_DEBUG=WARN
export PYTHONPATH=${PYTHONPATH:-}:/mnt/data/liuchonghan/verl

torchrun \
--nnodes=${NNODES} \
--node_rank=${NODE_RANK} \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
--nproc-per-node=8 \
${ENTRYPOINT} \
data.train_files="${TRAIN_FILES}" \
data.train_batch_size=512 \
data.max_length=8192 \
data.pad_mode=${PAD_MODE} \
data.truncation=right \
data.use_dynamic_bsz=True \
data.max_token_len_per_gpu=49152 \
data.messages_key=messages \
model.path=$MODEL_ID \
model.use_remove_padding=${USE_REMOVE_PADDING} \
+model.override_config.output_router_logits=True \
+model.override_config.router_dtype="float32" \
model.enable_gradient_checkpointing=True \
${ENGINE_CONFIG} \
trainer.test_freq=-1 \
trainer.save_freq=5000 \
'trainer.logger=[console]' \
trainer.project_name="${project_name}" \
trainer.experiment_name="${exp_name}" \
trainer.total_epochs=2 \
trainer.default_local_dir="${CKPT_HOME}" \
trainer.resume_mode=${RESUME_MODE} \
trainer.max_ckpt_to_keep=3 \
'checkpoint.save_contents=[model,optimizer,extra]'
Loading