[trainer] feat: add support for the GDPO algorithm by yue-zeng-yue · Pull Request #5409 · verl-project/verl

yue-zeng-yue · 2026-02-26T07:38:51Z

Description

This PR introduces support for the GDPO algorithm into the training framework. It enables the system to handle multiple reward functions effectively, which is a key requirement for multi-objective reinforcement learning tasks.

Reference Paper & Experimental Results

GDPO Paper: https://arxiv.org/abs/2601.05242

Performance Benchmarks:

Tool Calling: GDPO increases the correct format ratio from 76.33% to 80.66% and improves overall accuracy from 30.18% to 32.81% (BFCL-v3).

Math Reasoning: On AIME (DeepSeek-R1-7B), GDPO improves accuracy from 50.2% to 53.1% and reduces length-exceeding violations from 2.1% to 0.2%.

Coding Reasoning: In CodeContests (3-objective optimization), GDPO reduces the bug ratio significantly from 13.2% to 3.9% compared to GRPO.

Changes Made

Core Algorithm: Updated verl/trainer/ppo/core_algos.py to implement GDPO advantage computation logic.

Ray Trainer: Modified verl/trainer/ppo/ray_trainer.py to support multi-reward data preparation and metrics logging.

Configuration: Added necessary parameters in verl/trainer/config/algorithm.py to enable GDPO-specific settings.

Motivation

The current framework's limitations made it difficult to natively implement algorithms like GDPO that require complex or multiple reward function handling. This update modifies the core PPO trainer and Ray trainer to fully support GDPO.

CLAassistant · 2026-02-26T07:38:58Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces support for the GDPO algorithm. The changes span configuration, the core algorithm implementation, and integration into the Ray trainer, including data preparation and metrics logging. While the implementation is largely correct, I've identified a significant performance issue in the core GDPO advantage computation logic that should be addressed. My review includes a suggestion to vectorize a performance-critical loop.

gemini-code-assist · 2026-02-26T07:41:03Z

verl/trainer/ppo/core_algos.py

+        normalized = torch.zeros_like(reward_components)
+
+        id2indices = defaultdict(list)
+        for i in range(bs):
+            id2indices[index[i]].append(i)
+
+        for group_id, indices in id2indices.items():
+            idx_tensor = torch.tensor(indices, device=reward_components.device)
+            if len(indices) == 1:
+                normalized[indices[0]] = 0.0
+            else:
+                group_rewards = reward_components[idx_tensor]  # (group_size, n_rewards)
+                group_mean = group_rewards.mean(dim=0)
+                group_std = group_rewards.std(dim=0)
+                normalized[idx_tensor] = (group_rewards - group_mean) / (group_std + epsilon)


The current implementation for group-wise normalization iterates over each group using a Python loop. This can be a performance bottleneck, especially with a large number of groups or when running on GPU. This part of the computation should be vectorized to improve performance, similar to how compute_grpo_vectorized_outcome_advantage is implemented. A vectorized approach using torch.unique and scatter_add_ would be much more efficient.

Suggested change

normalized = torch.zeros_like(reward_components)

id2indices = defaultdict(list)

for i in range(bs):

id2indices[index[i]].append(i)

for group_id, indices in id2indices.items():

idx_tensor = torch.tensor(indices, device=reward_components.device)

if len(indices) == 1:

normalized[indices[0]] = 0.0

else:

group_rewards = reward_components[idx_tensor] # (group_size, n_rewards)

group_mean = group_rewards.mean(dim=0)

group_std = group_rewards.std(dim=0)

normalized[idx_tensor] = (group_rewards - group_mean) / (group_std + epsilon)

g = as_torch_index(index, device=reward_components.device)

unique_groups, group_indices, group_counts = torch.unique(g, return_inverse=True, return_counts=True)

group_sum = torch.zeros((unique_groups.shape[0], n_rewards), device=g.device, dtype=reward_components.dtype).scatter_add_(

0, group_indices.unsqueeze(1).expand(-1, n_rewards), reward_components

)

group_means_per_group = group_sum / group_counts.unsqueeze(1).clamp(min=1)

group_means = group_means_per_group[group_indices]

group_sum_sq = torch.zeros((unique_groups.shape[0], n_rewards), device=g.device, dtype=reward_components.dtype).scatter_add_(

0, group_indices.unsqueeze(1).expand(-1, n_rewards), reward_components.pow(2)

)

group_means_sq_per_group = group_sum_sq / group_counts.unsqueeze(1).clamp(min=1)

group_vars_per_group = (group_means_sq_per_group - group_means_per_group.pow(2)).clamp(min=0)

group_stds_per_group = torch.sqrt(group_vars_per_group)

group_stds = group_stds_per_group[group_indices]

normalized = (reward_components - group_means) / (group_stds + epsilon)

tardis-key · 2026-02-27T03:49:14Z

there are multi scores in gdpo rewards , and will be used in advEstimator laster. The current implementation does not seem to align with this algorithm.

wuxibin89 · 2026-02-27T07:11:54Z

verl/trainer/ppo/ray_trainer.py

-        checkpoint_engine_config = omega_conf_to_dataclass(self.config.actor_rollout_ref.rollout.checkpoint_engine)
        self.checkpoint_manager = CheckpointEngineManager(
-            config=checkpoint_engine_config,
+            backend=self.config.actor_rollout_ref.rollout.checkpoint_engine.backend,


Please rebase main first, this should not be changed.

ok，got it

Updated core_algos.py and ray_trainer.py to implement and integrate the GDPO algorithm into the training framework.

yue-zeng-yue · 2026-02-27T08:29:49Z

there are multi scores in gdpo rewards , and will be used in advEstimator laster. The current implementation does not seem to align with this algorithm.

Thank you for the review! I’ve double-checked the code, and it seems to align with the underlying GDPO calculation logic. Could you please point out specifically which part doesn't match the algorithm? I’d appreciate more details so I can fix it properly.

tardis-key · 2026-02-28T01:58:04Z

In verl/trainer/ppo/core_algos.py: line 395:
reward_components: (bs, N_rewards) – per-sample scores for each reward dimension.

These sample-level rewards do not take into account prompt_mask or attention_mask, resulting in significant differences compared to the official GDPO implementation, in GDPO/verl-GDPO/verl/trainer/main_ppo.py: line 85
score, fomrat_score, correctness_score, length_score = compute_score_fn(solution_str=sequences_str, ground_truth=ground_truth, step=step)

you can refer to #5422, which adapted reward_score

yue-zeng-yue · 2026-02-28T02:55:26Z

In verl/trainer/ppo/core_algos.py: line 395: reward_components: (bs, N_rewards) – per-sample scores for each reward dimension.

These sample-level rewards do not take into account prompt_mask or attention_mask, resulting in significant differences compared to the official GDPO implementation, in GDPO/verl-GDPO/verl/trainer/main_ppo.py: line 85 score, fomrat_score, correctness_score, length_score = compute_score_fn(solution_str=sequences_str, ground_truth=ground_truth, step=step)

you can refer to #5422, which adapted reward_score

Thanks for the review and for referencing #5422.

To clarify, my compute_gdpo_outcome_advantage in core_algos.py is the advantage computation step — it receives pre-computed per-dimension rewards as input and implements the three GDPO steps (group-wise decoupled normalization → weighted aggregation → batch-level normalization), which is mathematically consistent with the paper.

I agree that my PR currently lacks the multi-reward scoring pipeline (reward manager + compute_score function) to produce those per-dimension rewards. My PR assumes they are already available in non_tensor_batch, and the GDPO-related fields in algorithm.py (i.e., gdpo_reward_keys and gdpo_reward_weights) only define configuration parameters without the actual reward computation logic.

yue-zeng-yue · 2026-02-28T03:03:30Z

In verl/trainer/ppo/core_algos.py: line 395: reward_components: (bs, N_rewards) – per-sample scores for each reward dimension.

These sample-level rewards do not take into account prompt_mask or attention_mask, resulting in significant differences compared to the official GDPO implementation, in GDPO/verl-GDPO/verl/trainer/main_ppo.py: line 85 score, fomrat_score, correctness_score, length_score = compute_score_fn(solution_str=sequences_str, ground_truth=ground_truth, step=step)

you can refer to #5422, which adapted reward_score

My design intention was to keep the advantage computation decoupled from any specific reward function, so that users can plug in their own compute_score that returns a dict with custom reward keys.

yue-zeng-yue requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners February 26, 2026 07:38

gemini-code-assist bot reviewed Feb 26, 2026

View reviewed changes

Rhetee mentioned this pull request Feb 27, 2026

[WIP][algo] Migrate and implement the GDPO algorithm into the existing framework. #5422

Draft

8 tasks

wuxibin89 reviewed Feb 27, 2026

View reviewed changes

yue-zeng-yue added 2 commits February 27, 2026 16:12

feat: add support for the GDPO algorithm

3ff3648

Updated core_algos.py and ray_trainer.py to implement and integrate the GDPO algorithm into the training framework.

feat: update algorithm.py for GDPO

23246f2

yue-zeng-yue force-pushed the feat-gdpo branch from 9b236ea to 23246f2 Compare February 27, 2026 08:22

tongyx361 self-assigned this Feb 27, 2026

restore CheckpointEngineManager to match main

38faf8a

tardis-key self-requested a review February 28, 2026 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] feat: add support for the GDPO algorithm#5409

[trainer] feat: add support for the GDPO algorithm#5409
yue-zeng-yue wants to merge 3 commits intoverl-project:mainfrom
yue-zeng-yue:feat-gdpo

yue-zeng-yue commented Feb 26, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Feb 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 26, 2026

Uh oh!

tardis-key commented Feb 27, 2026

Uh oh!

wuxibin89 Feb 27, 2026

Uh oh!

yue-zeng-yue Feb 27, 2026

Uh oh!

yue-zeng-yue commented Feb 27, 2026

Uh oh!

tardis-key commented Feb 28, 2026

Uh oh!

yue-zeng-yue commented Feb 28, 2026

Uh oh!

yue-zeng-yue commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

-        normalized = torch.zeros_like(reward_components)
-        id2indices = defaultdict(list)
-        for i in range(bs):
-            id2indices[index[i]].append(i)
-        for group_id, indices in id2indices.items():
-            idx_tensor = torch.tensor(indices, device=reward_components.device)
-            if len(indices) == 1:
-                normalized[indices[0]] = 0.0
-            else:
-                group_rewards = reward_components[idx_tensor]  # (group_size, n_rewards)
-                group_mean = group_rewards.mean(dim=0)
-                group_std = group_rewards.std(dim=0)
-                normalized[idx_tensor] = (group_rewards - group_mean) / (group_std + epsilon)
+        g = as_torch_index(index, device=reward_components.device)
+        unique_groups, group_indices, group_counts = torch.unique(g, return_inverse=True, return_counts=True)
+        group_sum = torch.zeros((unique_groups.shape[0], n_rewards), device=g.device, dtype=reward_components.dtype).scatter_add_(
+, group_indices.unsqueeze(1).expand(-1, n_rewards), reward_components
+        )
+        group_means_per_group = group_sum / group_counts.unsqueeze(1).clamp(min=1)
+        group_means = group_means_per_group[group_indices]
+        group_sum_sq = torch.zeros((unique_groups.shape[0], n_rewards), device=g.device, dtype=reward_components.dtype).scatter_add_(
+, group_indices.unsqueeze(1).expand(-1, n_rewards), reward_components.pow(2)
+        )
+        group_means_sq_per_group = group_sum_sq / group_counts.unsqueeze(1).clamp(min=1)
+        group_vars_per_group = (group_means_sq_per_group - group_means_per_group.pow(2)).clamp(min=0)
+        group_stds_per_group = torch.sqrt(group_vars_per_group)
+        group_stds = group_stds_per_group[group_indices]
+        normalized = (reward_components - group_means) / (group_stds + epsilon)

Conversation

yue-zeng-yue commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reference Paper & Experimental Results

Changes Made

Motivation

Uh oh!

CLAassistant commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

tardis-key commented Feb 27, 2026

Uh oh!

wuxibin89 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

yue-zeng-yue Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

yue-zeng-yue commented Feb 27, 2026

Uh oh!

tardis-key commented Feb 28, 2026

Uh oh!

yue-zeng-yue commented Feb 28, 2026

Uh oh!

yue-zeng-yue commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yue-zeng-yue commented Feb 26, 2026 •

edited

Loading

CLAassistant commented Feb 26, 2026 •

edited

Loading