verl-project
diff --git a/‎examples/prefix_grouper/README.md‎
Lines changed: 85 additions & 0 deletions b/‎examples/prefix_grouper/README.md‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎examples/prefix_grouper/run_qwen3_prefix_grouper.sh‎
Lines changed: 43 additions & 0 deletions b/‎examples/prefix_grouper/run_qwen3_prefix_grouper.sh‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎tests/utils/test_seqlen_balancing.py‎
Lines changed: 76 additions & 0 deletions b/‎tests/utils/test_seqlen_balancing.py‎
Lines changed: 76 additions & 0 deletions
diff --git a/‎verl/models/transformers/monkey_patch.py‎
Lines changed: 43 additions & 1 deletion b/‎verl/models/transformers/monkey_patch.py‎
Lines changed: 43 additions & 1 deletion
diff --git a/‎verl/trainer/config/_generated_ppo_megatron_trainer.yaml‎
Lines changed: 1 addition & 0 deletions b/‎verl/trainer/config/_generated_ppo_megatron_trainer.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎verl/trainer/config/_generated_ppo_trainer.yaml‎
Lines changed: 1 addition & 0 deletions b/‎verl/trainer/config/_generated_ppo_trainer.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎verl/trainer/config/actor/actor.yaml‎
Lines changed: 3 additions & 0 deletions b/‎verl/trainer/config/actor/actor.yaml‎
Lines changed: 3 additions & 0 deletions
@@ -0,0 +1,85 @@
+# PrefixGrouper Examples
+
+This directory contains examples for using **PrefixGrouper**, an optimization technique that groups samples by shared prompts to reduce redundant computations in GRPO.
+
+## Introduction
+
+> Official Repository: [https://github.com/johncaged/PrefixGrouper](https://github.com/johncaged/PrefixGrouper)
+
+``PrefixGrouper`` is a plug-and-play efficient GRPO training tool that requires minimal modifications to existing codebases to achieve reduced computation, lower device memory consumption, and accelerated training.
+
+In current mainstream GRPO training pipelines, policy model training primarily involves copying prefixes (typically questions, multimodal inputs, etc.) `G` times. Consequently, when training data prefixes are sufficiently long (e.g., long-context reasoning, image/long-video inference), redundant computation during training becomes non-negligible.
+
+**PrefixGrouper** decomposes the original redundant self-attention operation into prefix self-attention + suffix concat-attention.
+
+<h3 align="center">
+    <img src="https://raw.githubusercontent.com/johncaged/PrefixGrouper/main/assets/images/method.jpg">
+</h3>
+
+## Installation
+
+```bash
+pip install prefix_grouper
+```
+
+## Limitations
+
+- Currently only supports FSDP worker (Megatron worker is not supported yet).
+- Incompatible with `use_dynamic_bsz=True`.
+- Incompatible with `use_remove_padding=True` (Flash Attention V2 variable length).
+- Incompatible with `use_fused_kernels=True`.
+- Incompatible with Ulysses sequence parallelism (`use_ulysses_sp=True`) and ring-attention.
+
+Note: `balance_batch=True` is now supported with group-level balancing, which keeps samples with the same uid together on the same rank. However, this requires `batch_size % (world_size * rollout.n) == 0`. For example, with `world_size=8` and `rollout.n=4`, you need `batch_size` to be a multiple of 32.
+
+## How to Use
+
+### 1. Enable PrefixGrouper in Config
+
+Simply set `use_prefix_grouper=True` in your training config:
+
+```yaml
+actor_rollout_ref:
+  actor:
+    use_prefix_grouper: True
+  model:
+    use_remove_padding: False 
+```
+
+Optionally enable balance_batch for better load distribution:
+```yaml
+trainer:
+  balance_batch: True  # Now supported with group-level balancing
+```
+
+### 2. Run Training
+
+Use the provided script `run_qwen3_prefix_grouper.sh` as an example:
+
+```bash
+bash examples/prefix_grouper/run_qwen3_prefix_grouper.sh
+```
+
+## How It Works
+
+When `use_prefix_grouper=True`, verl automatically patches the attention functions in `transformers.modeling_utils.ALL_ATTENTION_FUNCTIONS` to support the `prefix_grouper` parameter. No model code modifications are needed.
+
+The patch wraps each attention function to:
+1. Extract `prefix_grouper` from kwargs
+2. If `prefix_grouper` is None, call original attention
+3. If `prefix_grouper` is provided, use PrefixGrouper's optimized attention computation
+
+## Performance
+
+**Benchmark Results** (Qwen3-4B, 4×H800, `rollout.n=4`):
+
+| Context Length | Metric | PG | No PG | Speedup |
+|----------------|--------|-----|-------|---------|
+| **4K** | `old_log_prob` | 1.31s | 1.70s | **1.30x** |
+| | `update_actor` | 4.80s | 6.07s | **1.26x** |
+| | `step` | 17.08s | 19.40s | **1.14x** |
+| **8K** | `old_log_prob` | 1.69s | 2.63s | **1.56x** |
+| | `update_actor` | 5.98s | 10.18s | **1.70x** |
+| | `step` | 19.48s | 24.71s | **1.27x** |
+
+As context length increases, the speedup becomes more pronounced.
@@ -0,0 +1,43 @@
+set -x
+
+
+python3 -m verl.trainer.main_ppo \
+    algorithm.adv_estimator=grpo \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.max_prompt_length=512 \
+    data.max_response_length=1024 \
+    data.filter_overlong_prompts=True \
+    data.truncation='error' \
+    actor_rollout_ref.model.path=Qwen/Qwen3-8B \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.model.use_remove_padding=False \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.001 \
+    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
+    actor_rollout_ref.actor.entropy_coeff=0 \
+    actor_rollout_ref.model.enable_gradient_checkpointing=True \
+    actor_rollout_ref.actor.use_prefix_grouper=True \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.rollout.n=5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    algorithm.use_kl_in_reward=False \
+    trainer.critic_warmup=0 \
+    trainer.logger='["console","wandb"]' \
+    trainer.project_name='verl_grpo_example_gsm8k' \
+    trainer.experiment_name='qwen3_function_rm_pg' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=20 \
+    trainer.test_freq=5 \
+    trainer.balance_batch=True \
+    trainer.total_epochs=15 $@
@@ -200,3 +200,79 @@ def test_seqlen_balancing_distributed_params(tmp_path):
         nprocs=world_size,
         join=True,
     )
+
+
+def test_group_balanced_partitions():
+    """Test group-level balancing keeps same-uid samples together."""
+    from verl.utils.seqlen_balancing import get_group_balanced_partitions
+
+    # Create test data: 4 groups with different sizes
+    # Group 0 (uid=0): indices 0,1,2,3 with seqlens [100, 100, 100, 100]
+    # Group 1 (uid=1): indices 4,5,6,7 with seqlens [200, 200, 200, 200]
+    # Group 2 (uid=2): indices 8,9,10,11 with seqlens [150, 150, 150, 150]
+    # Group 3 (uid=3): indices 12,13,14,15 with seqlens [50, 50, 50, 50]
+    seqlen_list = [100] * 4 + [200] * 4 + [150] * 4 + [50] * 4
+    uid_list = [0] * 4 + [1] * 4 + [2] * 4 + [3] * 4
+
+    # Partition into 2 groups
+    partitions = get_group_balanced_partitions(seqlen_list, uid_list, k_partitions=2)
+
+    assert len(partitions) == 2
+
+    # Verify all indices are covered
+    all_indices = set()
+    for partition in partitions:
+        all_indices.update(partition)
+    assert all_indices == set(range(16))
+
+    # Verify same-uid samples stay together
+    for partition in partitions:
+        uids_in_partition = set(uid_list[i] for i in partition)
+        for uid in uids_in_partition:
+            # All samples with this uid should be in this partition
+            uid_indices = [i for i, u in enumerate(uid_list) if u == uid]
+            assert all(i in partition for i in uid_indices), f"uid {uid} samples split across partitions"
+
+
+def test_group_balanced_partitions_single_sample_groups():
+    """Test group balancing with single-sample groups (n=1)."""
+    from verl.utils.seqlen_balancing import get_group_balanced_partitions
+
+    # Each sample is its own group
+    seqlen_list = [100, 200, 150, 50, 300, 250]
+    uid_list = [0, 1, 2, 3, 4, 5]
+
+    partitions = get_group_balanced_partitions(seqlen_list, uid_list, k_partitions=2)
+
+    assert len(partitions) == 2
+    all_indices = set()
+    for partition in partitions:
+        all_indices.update(partition)
+    assert all_indices == set(range(6))
+
+
+def test_group_balanced_partitions_equal_size():
+    """Test group balancing with equal_size constraint simulation."""
+    from verl.utils.seqlen_balancing import get_group_balanced_partitions
+
+    # 8 groups, partition into 4 (simulating world_size=4)
+    # Each group has 2 samples
+    seqlen_list = [100, 100, 200, 200, 150, 150, 50, 50, 300, 300, 250, 250, 180, 180, 120, 120]
+    uid_list = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7]
+
+    partitions = get_group_balanced_partitions(seqlen_list, uid_list, k_partitions=4)
+
+    assert len(partitions) == 4
+
+    # Verify all indices are covered
+    all_indices = set()
+    for partition in partitions:
+        all_indices.update(partition)
+    assert all_indices == set(range(16))
+
+    # Verify same-uid samples stay together
+    for partition in partitions:
+        uids_in_partition = set(uid_list[i] for i in partition)
+        for uid in uids_in_partition:
+            uid_indices = [i for i, u in enumerate(uid_list) if u == uid]
+            assert all(i in partition for i in uid_indices)
@@ -33,6 +33,44 @@
     slice_input_tensor,
 )
 
+_PREFIX_GROUPER_PATCHED = False
+_PREFIX_GROUPER_SUPPORTED_ATTENTIONS = {"flash_attention_2", "flash_attention_3", "sdpa", "flex_attention", "eager"}
+
+
+def _create_prefix_grouper_wrapper(original_fn):
+    """Wrap attention function to support prefix_grouper in kwargs."""
+
+    def wrapped(module, query, key, value, attention_mask, *args, **kwargs):
+        prefix_grouper = kwargs.pop("prefix_grouper", None)
+        if prefix_grouper is None:
+            return original_fn(module, query, key, value, attention_mask, *args, **kwargs)
+
+        def attn_func(q, k, v, attn_mask, *inner_args, **inner_kwargs):
+            out, _ = original_fn(module, q, k, v, attn_mask, *inner_args, **inner_kwargs)
+            return out
+
+        return prefix_grouper.forward(attn_func, query, key, value, *args, **kwargs), None
+
+    return wrapped
+
+
+def apply_prefix_grouper_patch():
+    """Patch ALL_ATTENTION_FUNCTIONS to support prefix_grouper parameter."""
+    global _PREFIX_GROUPER_PATCHED
+    if _PREFIX_GROUPER_PATCHED:
+        return
+
+    from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
+
+    patched = []
+    for name in list(ALL_ATTENTION_FUNCTIONS.keys()):
+        if name in _PREFIX_GROUPER_SUPPORTED_ATTENTIONS:
+            ALL_ATTENTION_FUNCTIONS[name] = _create_prefix_grouper_wrapper(ALL_ATTENTION_FUNCTIONS[name])
+            patched.append(name)
+
+    _PREFIX_GROUPER_PATCHED = True
+    print(f"[PrefixGrouper] Patched: {patched}")
+
 
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
     """
@@ -251,11 +289,12 @@ def apply_monkey_patch(
     use_remove_padding: bool = True,
     use_fused_kernels: bool = False,
     fused_kernels_backend: str = None,
+    use_prefix_grouper: bool = False,
     use_tiled_mlp: bool = False,
     tiled_mlp_shards: int = 4,
 ):
     """
-    Apply monkey patch to the models for ulysses sequence parallel, fused kernel, and tiled MLP.
+    Apply monkey patch to the models for ulysses sequence parallel, fused kernel, tiled MLP and prefix grouper.
 
     In the end of this function forward function of the model is patched for fused kernel.
     If the model is not supported with fused kernel, please return after patch.
@@ -276,6 +315,9 @@ def apply_monkey_patch(
 
         model_type = getattr(model.config, "model_type", None)
         apply_tiled_mlp_monkey_patch(num_shards=tiled_mlp_shards, model_type=model_type)
+    # Apply PrefixGrouper patch if enabled
+    if use_prefix_grouper:
+        apply_prefix_grouper_patch()
 
     """Replace _flash_attention_forward to _ulysses_flash_attention_forward"""
     module = sys.modules[model.__module__]
 
@@ -85,6 +85,7 @@ actor_rollout_ref:
     entropy_coeff: 0
     calculate_entropy: false
     use_kl_loss: false
+    use_prefix_grouper: false
     use_torch_compile: true
     kl_loss_coef: 0.001
     kl_loss_type: low_var_kl
 
@@ -72,6 +72,7 @@ actor_rollout_ref:
     entropy_coeff: 0
     calculate_entropy: false
     use_kl_loss: false
+    use_prefix_grouper: false
     use_torch_compile: true
     kl_loss_coef: 0.001
     kl_loss_type: low_var_kl
 
@@ -94,6 +94,9 @@ calculate_entropy: false
 # Whether to use KL loss instead of KL reward penalty. True for GRPO
 use_kl_loss: false
 
+# Whether to enable PrefixGrouper shared-prefix forward
+use_prefix_grouper: false
+
 # Whether to use torch.compile()
 # oc.select: the default val for ref.use_torch_compile
 use_torch_compile: true