Skip to content

[BUG] verl Qwen3-235B-A22B save error #71

@zyfzjsc988

Description

@zyfzjsc988

After print Saving HF model checkpoint to {ckpt_path} with bridge 10 min, error raise.

this ckpt path is distributed filesystem. ENV NCCL_TIMEOUT=1200. I use qwen3vl_cp branch

and i checked result, weights files have all saved and did not update after 23:13. but error raised at 23:19.
Image

full error message:

local_global_step_folder: /secspace/share/ckpt/338411_f0bb9c8b98/338411_88086da222e84d85bb59/checkpoint/global_step_10
INFO:2026-01-23 23:10:45,654:[Rank 169] Saving HF model checkpoint to /secspace/share/ckpt/338411_f0bb9c8b98/338411_88086da222e84d85bb59/checkpoint/global_step_10/actor with bridge
/opt/conda/lib/python3.12/site-packages/torch/autograd/graph.py:829: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /job_3594441/source/pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.)�[32m [repeated 15x across cluster]�[0m
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass�[32m [repeated 15x across cluster]�[0m
*** SIGBUS received at time=1769181593 on cpu 48 ***
PC: @     0x7fc78e59bb43  (unknown)  __memcpy_avx_unaligned_erms
    @     0x7fc78e986100  (unknown)  (unknown)
    @     0x560224616ec0  (unknown)  (unknown)
[2026-01-23 23:19:53,577 E 1423 4428] logging.cc:501: *** SIGBUS received at time=1769181593 on cpu 48 ***
[2026-01-23 23:19:53,577 E 1423 4428] logging.cc:501: PC: @     0x7fc78e59bb43  (unknown)  __memcpy_avx_unaligned_erms
[2026-01-23 23:19:53,587 E 1423 4428] logging.cc:501:     @     0x7fc78e986100  (unknown)  (unknown)
[2026-01-23 23:19:53,596 E 1423 4428] logging.cc:501:     @     0x560224616ec0  (unknown)  (unknown)
Fatal Python error: Bus error

Thread 0x00007f96168fe640 (most recent call first):
  <no Python frame>

Thread 0x00007f8d7f24f640 (most recent call first):
  <no Python frame>

Thread 0x00007f92126fd640 (most recent call first):
  File "/opt/conda/lib/python3.12/threading.py", line 359 in wait
  File "/opt/conda/lib/python3.12/threading.py", line 655 in wait
  File "/opt/conda/lib/python3.12/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/opt/conda/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007f9614bff640 (most recent call first):
  File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 61 in _recv_msg
  File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 195 in _read_thread
  File "/opt/conda/lib/python3.12/threading.py", line 1012 in run
  File "/opt/conda/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/opt/conda/lib/python3.12/threading.py", line 1032 in _bootstrap

Current thread 0x00007f96170ff640 (most recent call first):
  File "/opt/conda/lib/python3.12/site-packages/safetensors/torch.py", line 545 in _tobytes
  File "/opt/conda/lib/python3.12/site-packages/safetensors/torch.py", line 589 in _flatten
  File "/opt/conda/lib/python3.12/site-packages/safetensors/torch.py", line 352 in save_file
  File "/opt/conda/lib/python3.12/site-packages/mbridge/core/safetensor_io.py", line 207 in save_hf_weight_merge
  File "/opt/conda/lib/python3.12/site-packages/mbridge/core/bridge.py", line 282 in _save_weights_fast
  File "/opt/conda/lib/python3.12/site-packages/mbridge/core/bridge.py", line 319 in save_weights
  File "/root/verl/verl/utils/checkpoint/megatron_checkpoint_manager.py", line 505 in save_checkpoint
  File "/root/verl/verl/workers/engine/megatron/transformer_impl.py", line 442 in save_checkpoint
  File "/root/verl/verl/workers/engine_workers.py", line 343 in save_checkpoint
  File "/root/verl/verl/utils/transferqueue_utils.py", line 314 in dummy_inner
  File "/root/verl/verl/single_controller/base/decorator.py", line 456 in inner
  File "/root/verl/verl/workers/engine_workers.py", line 541 in save_checkpoint
  File "/root/verl/verl/utils/transferqueue_utils.py", line 314 in dummy_inner
  File "/root/verl/verl/single_controller/base/decorator.py", line 456 in inner
  File "/root/verl/verl/single_controller/ray/base.py", line 844 in func
  File "/opt/conda/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 461 in _resume_span
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/async_compat.py", line 50 in wrapper
  File "/opt/conda/lib/python3.12/threading.py", line 1012 in run
  File "/opt/conda/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/opt/conda/lib/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007fc78e81b740 (most recent call first):
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 974 in main_loop
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/workers/default_worker.py", line 321 in <module>

Error executing job with overrides: ['custom_reward_function.path=/workspace/bin/rl_reward.py', 'reward_model.reward_manager=dapo', '+reward_model.reward_kwargs.overlong_buffer_cfg.enable=True', '+reward_model.reward_kwargs.overlong_buffer_cfg.len=1024', '+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=1', '+reward_model.reward_kwargs.overlong_buffer_cfg.log=False', '+reward_model.reward_kwargs.max_resp_len=2048', 'algorithm.adv_estimator=grpo', 'algorithm.use_kl_in_reward=False', 'algorithm.kl_ctrl.kl_coef=0.001', 'algorithm.gamma=1.0', 'algorithm.lam=0.95', 'data.train_files=[/secspace/share/data/338411/data/verl-team/math/train.parquet]', 'data.val_files=[/secspace/share/data/338411/data/verl-team/math/test.parquet]', 'data.train_batch_size=256', 'data.max_prompt_length=6144', 'data.max_response_length=2048', 'data.filter_overlong_prompts=True', 'data.filter_overlong_prompts_workers=64', 'data.truncation=error', 'data.trust_remote_code=True', 'actor_rollout_ref.model.path=/secspace/share/model/Qwen3-235B-A22B', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.model.trust_remote_code=True', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.optim.lr=1e-06', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.01', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.actor.entropy_coeff=0', 'actor_rollout_ref.actor.use_dynamic_bsz=True', 'actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24576', 'actor_rollout_ref.actor.checkpoint.save_contents=[model]', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.megatron.context_parallel_size=1', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=8', 'actor_rollout_ref.actor.megatron.expert_model_parallel_size=8', 'actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=1', 'actor_rollout_ref.actor.megatron.param_offload=True', 'actor_rollout_ref.actor.megatron.grad_offload=True', 'actor_rollout_ref.actor.megatron.optimizer_offload=True', 'actor_rollout_ref.actor.megatron.use_mbridge=True', '+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform', '+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full', '+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1', '+actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True', '+actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.ref.megatron.context_parallel_size=1', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=8', 'actor_rollout_ref.ref.megatron.expert_model_parallel_size=8', 'actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=1', 'actor_rollout_ref.ref.megatron.param_offload=True', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.mode=async', 'actor_rollout_ref.rollout.tensor_model_parallel_size=8', 'actor_rollout_ref.rollout.data_parallel_size=1', 'actor_rollout_ref.rollout.expert_parallel_size=1', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.8', 'actor_rollout_ref.rollout.n=8', 'trainer.default_local_dir=/secspace/share/ckpt/338411_f0bb9c8b98/338411_88086da222e84d85bb59/checkpoint', 'trainer.use_legacy_worker_impl=disable', 'trainer.critic_warmup=0', 'trainer.logger=[console,wandb,tensorboard]', 'trainer.n_gpus_per_node=16', 'trainer.nnodes=16', 'trainer.val_before_train=True', 'trainer.project_name=Qwen3-235B-A22B_338411_88086da222e84d85bb59', 'trainer.experiment_name=338411_88086da222e84d85bb59', 'trainer.save_freq=10', 'trainer.test_freq=10', 'trainer.total_epochs=1', 'trainer.max_actor_ckpt_to_keep=3', 'trainer.max_critic_ckpt_to_keep=3', 'actor_rollout_ref.rollout.val_kwargs.top_p=0.95', 'actor_rollout_ref.rollout.val_kwargs.temperature=0.1', 'actor_rollout_ref.rollout.val_kwargs.top_k=50', '+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=5', '+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=5']
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/verl/verl/trainer/main_ppo.py", line 443, in <module>
    main()
  File "/opt/conda/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/verl/verl/trainer/main_ppo.py", line 45, in main
    run_ppo(config)
  File "/root/verl/verl/trainer/main_ppo.py", line 99, in run_ppo
    ray.get(runner.run.remote(config))
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 2858, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 958, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): �[36mray::TaskRunner.run()�[39m (pid=16296, ip=33.212.70.155, actor_id=3fe72ba8e3fed2d0bff5686202000000, repr=<main_ppo.TaskRunner object at 0x7f41a326a900>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/verl/verl/trainer/main_ppo.py", line 362, in run
    trainer.fit()
  File "/root/verl/verl/trainer/ppo/ray_trainer.py", line 1654, in fit
    self._save_checkpoint()
  File "/root/verl/verl/trainer/ppo/ray_trainer.py", line 982, in _save_checkpoint
    self.actor_rollout_wg.save_checkpoint(
  File "/root/verl/verl/single_controller/ray/base.py", line 54, in __call__
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): �[36mray::WorkerDict.actor_rollout_ref_save_checkpoint()�[39m (pid=1454, ip=33.212.68.103, actor_id=a334a21532b849c8ac6fe2eb02000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7ecaf4fe9f10>)
  File "/opt/conda/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/verl/verl/single_controller/ray/base.py", line 844, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/verl/verl/single_controller/base/decorator.py", line 456, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/verl/verl/utils/transferqueue_utils.py", line 314, in dummy_inner
    output = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/root/verl/verl/workers/engine_workers.py", line 541, in save_checkpoint
    self.actor.save_checkpoint(local_path, hdfs_path, global_step, max_ckpt_to_keep)
  File "/root/verl/verl/single_controller/base/decorator.py", line 456, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/verl/verl/utils/transferqueue_utils.py", line 314, in dummy_inner
    output = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/root/verl/verl/workers/engine_workers.py", line 343, in save_checkpoint
    return self.engine.save_checkpoint(local_path, hdfs_path, global_step, max_ckpt_to_keep)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/verl/verl/workers/engine/megatron/transformer_impl.py", line 442, in save_checkpoint
    self.checkpoint_mananager.save_checkpoint(
  File "/root/verl/verl/utils/checkpoint/megatron_checkpoint_manager.py", line 505, in save_checkpoint
    self.bridge.save_weights(
  File "/opt/conda/lib/python3.12/site-packages/mbridge/core/bridge.py", line 319, in save_weights
    return self._save_weights_fast(per_tensor_generator, weights_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/mbridge/core/bridge.py", line 291, in _save_weights_fast
    torch.distributed.barrier()
  File "/opt/conda/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
    work.wait()
RuntimeError: [/job_3594441/source/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [33.212.68.49]:56214

@ISEEKYAN

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions