-
Notifications
You must be signed in to change notification settings - Fork 54
Open
Description
After print Saving HF model checkpoint to {ckpt_path} with bridge 10 min, error raise.
this ckpt path is distributed filesystem. ENV NCCL_TIMEOUT=1200. I use qwen3vl_cp branch
and i checked result, weights files have all saved and did not update after 23:13. but error raised at 23:19.

full error message:
local_global_step_folder: /secspace/share/ckpt/338411_f0bb9c8b98/338411_88086da222e84d85bb59/checkpoint/global_step_10
INFO:2026-01-23 23:10:45,654:[Rank 169] Saving HF model checkpoint to /secspace/share/ckpt/338411_f0bb9c8b98/338411_88086da222e84d85bb59/checkpoint/global_step_10/actor with bridge
/opt/conda/lib/python3.12/site-packages/torch/autograd/graph.py:829: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /job_3594441/source/pytorch/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:62.)�[32m [repeated 15x across cluster]�[0m
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass�[32m [repeated 15x across cluster]�[0m
*** SIGBUS received at time=1769181593 on cpu 48 ***
PC: @ 0x7fc78e59bb43 (unknown) __memcpy_avx_unaligned_erms
@ 0x7fc78e986100 (unknown) (unknown)
@ 0x560224616ec0 (unknown) (unknown)
[2026-01-23 23:19:53,577 E 1423 4428] logging.cc:501: *** SIGBUS received at time=1769181593 on cpu 48 ***
[2026-01-23 23:19:53,577 E 1423 4428] logging.cc:501: PC: @ 0x7fc78e59bb43 (unknown) __memcpy_avx_unaligned_erms
[2026-01-23 23:19:53,587 E 1423 4428] logging.cc:501: @ 0x7fc78e986100 (unknown) (unknown)
[2026-01-23 23:19:53,596 E 1423 4428] logging.cc:501: @ 0x560224616ec0 (unknown) (unknown)
Fatal Python error: Bus error
Thread 0x00007f96168fe640 (most recent call first):
<no Python frame>
Thread 0x00007f8d7f24f640 (most recent call first):
<no Python frame>
Thread 0x00007f92126fd640 (most recent call first):
File "/opt/conda/lib/python3.12/threading.py", line 359 in wait
File "/opt/conda/lib/python3.12/threading.py", line 655 in wait
File "/opt/conda/lib/python3.12/site-packages/tqdm/_monitor.py", line 60 in run
File "/opt/conda/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
File "/opt/conda/lib/python3.12/threading.py", line 1032 in _bootstrap
Thread 0x00007f9614bff640 (most recent call first):
File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 61 in _recv_msg
File "/opt/conda/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 195 in _read_thread
File "/opt/conda/lib/python3.12/threading.py", line 1012 in run
File "/opt/conda/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
File "/opt/conda/lib/python3.12/threading.py", line 1032 in _bootstrap
Current thread 0x00007f96170ff640 (most recent call first):
File "/opt/conda/lib/python3.12/site-packages/safetensors/torch.py", line 545 in _tobytes
File "/opt/conda/lib/python3.12/site-packages/safetensors/torch.py", line 589 in _flatten
File "/opt/conda/lib/python3.12/site-packages/safetensors/torch.py", line 352 in save_file
File "/opt/conda/lib/python3.12/site-packages/mbridge/core/safetensor_io.py", line 207 in save_hf_weight_merge
File "/opt/conda/lib/python3.12/site-packages/mbridge/core/bridge.py", line 282 in _save_weights_fast
File "/opt/conda/lib/python3.12/site-packages/mbridge/core/bridge.py", line 319 in save_weights
File "/root/verl/verl/utils/checkpoint/megatron_checkpoint_manager.py", line 505 in save_checkpoint
File "/root/verl/verl/workers/engine/megatron/transformer_impl.py", line 442 in save_checkpoint
File "/root/verl/verl/workers/engine_workers.py", line 343 in save_checkpoint
File "/root/verl/verl/utils/transferqueue_utils.py", line 314 in dummy_inner
File "/root/verl/verl/single_controller/base/decorator.py", line 456 in inner
File "/root/verl/verl/workers/engine_workers.py", line 541 in save_checkpoint
File "/root/verl/verl/utils/transferqueue_utils.py", line 314 in dummy_inner
File "/root/verl/verl/single_controller/base/decorator.py", line 456 in inner
File "/root/verl/verl/single_controller/ray/base.py", line 844 in func
File "/opt/conda/lib/python3.12/site-packages/ray/util/tracing/tracing_helper.py", line 461 in _resume_span
File "/opt/conda/lib/python3.12/site-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
File "/opt/conda/lib/python3.12/site-packages/ray/_private/async_compat.py", line 50 in wrapper
File "/opt/conda/lib/python3.12/threading.py", line 1012 in run
File "/opt/conda/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
File "/opt/conda/lib/python3.12/threading.py", line 1032 in _bootstrap
Thread 0x00007fc78e81b740 (most recent call first):
File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 974 in main_loop
File "/opt/conda/lib/python3.12/site-packages/ray/_private/workers/default_worker.py", line 321 in <module>
Error executing job with overrides: ['custom_reward_function.path=/workspace/bin/rl_reward.py', 'reward_model.reward_manager=dapo', '+reward_model.reward_kwargs.overlong_buffer_cfg.enable=True', '+reward_model.reward_kwargs.overlong_buffer_cfg.len=1024', '+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=1', '+reward_model.reward_kwargs.overlong_buffer_cfg.log=False', '+reward_model.reward_kwargs.max_resp_len=2048', 'algorithm.adv_estimator=grpo', 'algorithm.use_kl_in_reward=False', 'algorithm.kl_ctrl.kl_coef=0.001', 'algorithm.gamma=1.0', 'algorithm.lam=0.95', 'data.train_files=[/secspace/share/data/338411/data/verl-team/math/train.parquet]', 'data.val_files=[/secspace/share/data/338411/data/verl-team/math/test.parquet]', 'data.train_batch_size=256', 'data.max_prompt_length=6144', 'data.max_response_length=2048', 'data.filter_overlong_prompts=True', 'data.filter_overlong_prompts_workers=64', 'data.truncation=error', 'data.trust_remote_code=True', 'actor_rollout_ref.model.path=/secspace/share/model/Qwen3-235B-A22B', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.model.trust_remote_code=True', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.optim.lr=1e-06', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.01', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.actor.entropy_coeff=0', 'actor_rollout_ref.actor.use_dynamic_bsz=True', 'actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24576', 'actor_rollout_ref.actor.checkpoint.save_contents=[model]', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.megatron.context_parallel_size=1', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=8', 'actor_rollout_ref.actor.megatron.expert_model_parallel_size=8', 'actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=1', 'actor_rollout_ref.actor.megatron.param_offload=True', 'actor_rollout_ref.actor.megatron.grad_offload=True', 'actor_rollout_ref.actor.megatron.optimizer_offload=True', 'actor_rollout_ref.actor.megatron.use_mbridge=True', '+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform', '+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full', '+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1', '+actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=True', '+actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.ref.megatron.context_parallel_size=1', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=8', 'actor_rollout_ref.ref.megatron.expert_model_parallel_size=8', 'actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=1', 'actor_rollout_ref.ref.megatron.param_offload=True', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.mode=async', 'actor_rollout_ref.rollout.tensor_model_parallel_size=8', 'actor_rollout_ref.rollout.data_parallel_size=1', 'actor_rollout_ref.rollout.expert_parallel_size=1', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.8', 'actor_rollout_ref.rollout.n=8', 'trainer.default_local_dir=/secspace/share/ckpt/338411_f0bb9c8b98/338411_88086da222e84d85bb59/checkpoint', 'trainer.use_legacy_worker_impl=disable', 'trainer.critic_warmup=0', 'trainer.logger=[console,wandb,tensorboard]', 'trainer.n_gpus_per_node=16', 'trainer.nnodes=16', 'trainer.val_before_train=True', 'trainer.project_name=Qwen3-235B-A22B_338411_88086da222e84d85bb59', 'trainer.experiment_name=338411_88086da222e84d85bb59', 'trainer.save_freq=10', 'trainer.test_freq=10', 'trainer.total_epochs=1', 'trainer.max_actor_ckpt_to_keep=3', 'trainer.max_critic_ckpt_to_keep=3', 'actor_rollout_ref.rollout.val_kwargs.top_p=0.95', 'actor_rollout_ref.rollout.val_kwargs.temperature=0.1', 'actor_rollout_ref.rollout.val_kwargs.top_k=50', '+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=5', '+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=5']
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/root/verl/verl/trainer/main_ppo.py", line 443, in <module>
main()
File "/opt/conda/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/opt/conda/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/trainer/main_ppo.py", line 45, in main
run_ppo(config)
File "/root/verl/verl/trainer/main_ppo.py", line 99, in run_ppo
ray.get(runner.run.remote(config))
File "/opt/conda/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 2858, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/ray/_private/worker.py", line 958, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): �[36mray::TaskRunner.run()�[39m (pid=16296, ip=33.212.70.155, actor_id=3fe72ba8e3fed2d0bff5686202000000, repr=<main_ppo.TaskRunner object at 0x7f41a326a900>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/trainer/main_ppo.py", line 362, in run
trainer.fit()
File "/root/verl/verl/trainer/ppo/ray_trainer.py", line 1654, in fit
self._save_checkpoint()
File "/root/verl/verl/trainer/ppo/ray_trainer.py", line 982, in _save_checkpoint
self.actor_rollout_wg.save_checkpoint(
File "/root/verl/verl/single_controller/ray/base.py", line 54, in __call__
output = ray.get(output)
^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(RuntimeError): �[36mray::WorkerDict.actor_rollout_ref_save_checkpoint()�[39m (pid=1454, ip=33.212.68.103, actor_id=a334a21532b849c8ac6fe2eb02000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7ecaf4fe9f10>)
File "/opt/conda/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/single_controller/ray/base.py", line 844, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/single_controller/base/decorator.py", line 456, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/utils/transferqueue_utils.py", line 314, in dummy_inner
output = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/workers/engine_workers.py", line 541, in save_checkpoint
self.actor.save_checkpoint(local_path, hdfs_path, global_step, max_ckpt_to_keep)
File "/root/verl/verl/single_controller/base/decorator.py", line 456, in inner
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/utils/transferqueue_utils.py", line 314, in dummy_inner
output = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/workers/engine_workers.py", line 343, in save_checkpoint
return self.engine.save_checkpoint(local_path, hdfs_path, global_step, max_ckpt_to_keep)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/workers/engine/megatron/transformer_impl.py", line 442, in save_checkpoint
self.checkpoint_mananager.save_checkpoint(
File "/root/verl/verl/utils/checkpoint/megatron_checkpoint_manager.py", line 505, in save_checkpoint
self.bridge.save_weights(
File "/opt/conda/lib/python3.12/site-packages/mbridge/core/bridge.py", line 319, in save_weights
return self._save_weights_fast(per_tensor_generator, weights_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/mbridge/core/bridge.py", line 291, in _save_weights_fast
torch.distributed.barrier()
File "/opt/conda/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 4818, in barrier
work.wait()
RuntimeError: [/job_3594441/source/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:544] Connection closed by peer [33.212.68.49]:56214Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels