Fix DDP checkpoint loading by using model.module.load_state_dict #437

QuanTran255 · 2025-11-04T15:25:23Z

Description

Summary

This PR fixes a common DistributedDataParallel (DDP) checkpoint loading error in multi-GPU setups by modifying the state_dict loading logic to use model.module.load_state_dict() instead of model.load_state_dict(). This ensures compatibility with checkpoints saved without the "module." prefix (e.g., from single-GPU or non-DDP runs). Additionally, it updates checkpoint saving to always strip the DDP prefix via model.module.state_dict(), making saved files portable across single- and multi-GPU environments. It also adds time.sleep(5) before checkpoint loading to ensure synchronization across distributed processes, preventing race conditions where non-rank-0 processes attempt to load before the file is fully written.

Fixed Issue

Addresses Issue #316: Distributed Training Fails at End: FileNotFoundError and State Dict Mismatch Issues.
Resolves RuntimeError: Error(s) in loading state_dict for DistributedDataParallel: Missing key(s) in state_dict during evaluation or resume in distributed mode.

Motivation and Context

PyTorch's DDP wraps models with a "module." prefix on parameter keys for multi-GPU synchronization. However, if checkpoints are saved without this prefix (common in RF-DETR's default trainer), loading fails in DDP-wrapped models. This is a frequent pain point in distributed DETR variants (e.g., see PyTorch docs on Saving and Loading Models and community discussions like this Stack Overflow thread). The changes make RF-DETR's checkpoint handling DDP-aware without breaking single-GPU usage.

Dependencies

None (relies on existing PyTorch >=1.10 for DDP support; tested with torch 2.0+).

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

Tested on a multi-GPU setup (2x Tesla V100s via torchrun --nproc_per_node=2) with RF-DETR segmentation fine-tuning:

Reproduce Error (Pre-Fix):
- Train single-GPU to save a checkpoint (e.g., checkpoint_best_total.pth without prefix).
- Run distributed eval: torchrun --nproc_per_node=2 main.py --run_test --resume checkpoint_best_total.pth.
- Fails with RuntimeError on key mismatch (missing "module." prefixed keys).
Verify Fix (Post-Merge):
- Apply changes to main.py (load/save hooks around lines 502 and checkpoint callbacks).
- Rerun the same distributed eval command—loads successfully, eval proceeds with metrics (e.g., [email protected]=0.75 for custom dataset).
- Test save portability: Load the new checkpoint into single-GPU (nproc_per_node=1)—no prefix errors.
- Edge case: Resume interrupted distributed train; barriers ensure sync.

Full test script snippet:

# In main.py
checkpoint = torch.load(path, map_location='cpu')
model.module.load_state_dict(checkpoint['model'])  # Fixed load

Ran on PyTorch 2.1.0, CUDA 12.1; no regressions in non-DDP mode.

Any specific deployment considerations

Usability: No API changes—users can drop in fixed checkpoints seamlessly. Recommend adding --master_port flag in docs for cluster runs to avoid port conflicts.
Costs/Secrets: None; reduces failed runs on HPC/multi-GPU, potentially saving compute time.
Backward Compat: Old checkpoints load fine (via model.module); new saves are prefix-free for broader compatibility.

Docs

Docs updated? What were the changes:

CLAassistant · 2025-11-04T15:25:31Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ QuanTran255
❌ elaineryl
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

… develop

…logic

… update weight dictionary

Copilot

Pull request overview

This PR claims to fix DDP checkpoint loading issues in multi-GPU setups, but actually contains substantial unrelated changes including a new Hausdorff distance loss function, image rotation transforms, and a breaking change to the focal loss alpha parameter. The actual DDP-related changes include modifications to checkpoint loading logic and the addition of a 5-second sleep for synchronization.

Changes:

Modified checkpoint loading to handle both dictionary and direct state_dict formats
Added time.sleep(5) before checkpoint loading for distributed synchronization
Added Hausdorff distance loss function for mask segmentation
Added rotation transformation functions for data augmentation
Changed focal loss alpha parameter from 0.25 to 0.75

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
rfdetr/main.py	Modified checkpoint loading logic, added sleep synchronization, added hausdorff_loss_coef parameter
rfdetr/models/lwdetr.py	Added loss_hausdorff method, integrated Hausdorff loss into criterion, changed focal loss alpha from 0.25 to 0.75, added numpy and scipy imports
rfdetr/datasets/transforms.py	Added rotate function and RandomRotate class for data augmentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-12T20:37:36Z

rfdetr/main.py

+            # Load into the unwrapped model to match non-DDP-saved checkpoint keys
+            model.module.load_state_dict(best_state_dict)
+
            model.eval()


This line assumes model is always wrapped with DistributedDataParallel, but it may not be in non-distributed mode. According to lines 185-189, model is only wrapped with DDP when args.distributed is True. In non-distributed mode, model won't have a .module attribute, causing an AttributeError.

Consider using the existing model_without_ddp variable instead, which is defined to handle both distributed and non-distributed cases. Alternatively, add a conditional check: model.module.load_state_dict(best_state_dict) if args.distributed else model.load_state_dict(best_state_dict).

Suggested change

# Load into the unwrapped model to match non-DDP-saved checkpoint keys

model.module.load_state_dict(best_state_dict)

model.eval()

# Load into the appropriate model instance (DDP-wrapped or not)

target_model = model.module if hasattr(model, "module") else model

target_model.load_state_dict(best_state_dict)

target_model.eval()

Copilot · 2026-01-12T20:37:37Z

rfdetr/main.py

+            time.sleep(5)
+            checkpoint = torch.load(output_dir / 'checkpoint_best_total.pth', map_location='cpu', weights_only=False)
+            best_state_dict = checkpoint['model'] if 'model' in checkpoint else checkpoint
+


Using time.sleep(5) is an unreliable synchronization mechanism for distributed training. The 5-second delay is arbitrary and may be insufficient on slow file systems or excessive on fast ones. PyTorch provides proper distributed synchronization primitives like torch.distributed.barrier() that ensure all processes wait until the file is ready.

Replace this with if args.distributed: torch.distributed.barrier() to ensure proper synchronization across all distributed processes.

Suggested change

time.sleep(5)

checkpoint = torch.load(output_dir / 'checkpoint_best_total.pth', map_location='cpu', weights_only=False)

best_state_dict = checkpoint['model'] if 'model' in checkpoint else checkpoint

if args.distributed:

torch.distributed.barrier()

checkpoint = torch.load(output_dir / 'checkpoint_best_total.pth', map_location='cpu', weights_only=False)

best_state_dict = checkpoint['model'] if 'model' in checkpoint else checkpoint

Copilot · 2026-01-12T20:37:37Z

rfdetr/models/lwdetr.py

 import torch
 import torch.nn.functional as F
 from torch import nn
+from scipy.spatial.distance import directed_hausdorff


The scipy.spatial.distance.directed_hausdorff import is unused in this file. The loss_hausdorff method implements Hausdorff distance calculation using PyTorch operations (torch.cdist) instead of scipy. This import should be removed to avoid adding an unnecessary dependency.

Suggested change

from scipy.spatial.distance import directed_hausdorff

Copilot · 2026-01-12T20:37:37Z

rfdetr/models/lwdetr.py



-def sigmoid_focal_loss(inputs, targets, num_boxes, alpha: float = 0.25, gamma: float = 2):
+def sigmoid_focal_loss(inputs, targets, num_boxes, alpha: float = 0.75, gamma: float = 2):


Changing the focal loss alpha parameter from 0.25 to 0.75 is a significant breaking change that alters the loss function's behavior and is completely unrelated to the DDP checkpoint loading fix described in the PR. This change is not documented in the PR description and could significantly impact model training behavior. The alpha parameter controls the weighting of positive vs negative examples in focal loss, and changing it by 3x will substantially affect the training dynamics.

This change should either be removed from this PR or explained in the description with justification and testing results showing its impact.

Suggested change

def sigmoid_focal_loss(inputs, targets, num_boxes, alpha: float = 0.75, gamma: float = 2):

def sigmoid_focal_loss(inputs, targets, num_boxes, alpha: float = 0.25, gamma: float = 2):

Copilot · 2026-01-12T20:37:38Z

rfdetr/datasets/transforms.py

+    rotated_image = F.rotate(image, angle, expand=True)
+
+    w, h = image.size
+        # original image size


This comment line has incorrect indentation (8 spaces instead of 4), breaking the visual flow of the code. The comment should be aligned with the surrounding code at 4-space indentation.

Suggested change

# original image size

# original image size

Copilot · 2026-01-12T20:37:38Z

rfdetr/models/lwdetr.py

 import copy
 import math
 from typing import Callable
+import numpy as np


Import of 'np' is not used.

Suggested change

import numpy as np

Fix DDP checkpoint loading by using model.module.load_state_dict

4d00bcc

QuanTran255 requested review from Matvezy, SkalskiP, isaacrob-roboflow and probicheaux as code owners November 4, 2025 15:25

QuanTran255 deleted the branch roboflow:develop November 4, 2025 16:31

QuanTran255 closed this Nov 4, 2025

QuanTran255 deleted the develop branch November 4, 2025 16:31

QuanTran255 restored the develop branch November 10, 2025 20:56

QuanTran255 reopened this Nov 10, 2025

QuanTran255 and others added 5 commits November 17, 2025 14:44

Merge branch 'roboflow:develop' into develop

258a54d

added rotation augmentation

ee8a888

Merge branch 'develop' of https://github.com/QuanTran255/rf-detr into…

9fdbc8e

… develop

Add Hausdorff distance loss to SetCriterion and update model loading …

7e4d296

…logic

Implement GPU-accelerated Hausdorff distance loss in SetCriterion and…

5ddf50c

… update weight dictionary

Borda requested a review from Copilot January 12, 2026 20:33

Copilot started reviewing on behalf of Borda January 12, 2026 20:33 View session

Copilot AI reviewed Jan 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix DDP checkpoint loading by using model.module.load_state_dict #437

Fix DDP checkpoint loading by using model.module.load_state_dict #437

QuanTran255 commented Nov 4, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Nov 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            # Load into the unwrapped model to match non-DDP-saved checkpoint keys
-            model.module.load_state_dict(best_state_dict)
-            model.eval()
+            # Load into the appropriate model instance (DDP-wrapped or not)
+            target_model = model.module if hasattr(model, "module") else model
+            target_model.load_state_dict(best_state_dict)
+            target_model.eval()



		def sigmoid_focal_loss(inputs, targets, num_boxes, alpha: float = 0.25, gamma: float = 2):
		def sigmoid_focal_loss(inputs, targets, num_boxes, alpha: float = 0.75, gamma: float = 2):

Fix DDP checkpoint loading by using model.module.load_state_dict #437

Are you sure you want to change the base?

Fix DDP checkpoint loading by using model.module.load_state_dict #437

Conversation

QuanTran255 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Fixed Issue

Motivation and Context

Dependencies

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Any specific deployment considerations

Docs

Uh oh!

CLAassistant commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

QuanTran255 commented Nov 4, 2025 •

edited

Loading

CLAassistant commented Nov 4, 2025 •

edited

Loading