Bug Report: filter_overlong_prompts Fails for Multimodal Data #5004
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Bug Report:
filter_overlong_promptsFails for Multimodal Data🐛 Problem Summary
When using
filter_overlong_prompts=Truewith multimodal datasets (containing images/videos), the filtering mechanism completely fails to correctly calculate prompt lengths, causing:RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 2048 but got size 2597🔍 Root Cause Analysis
Two-Level Bug in
verl/utils/dataset/rl_dataset.pyLevel 1: Logic Error with
pop()Side EffectIn
_build_messages(), the code usesexample.pop()to extract images/videos:Level 2: Broken Conditional Check
In
maybe_filter_out_long_prompts(), the code checks for images after they've been deleted:Result: Severe Length Underestimation
📊 Impact Assessment
Affected Scenarios
filter_overlong_prompts=TrueSeverity
Example Numerical Impact
For a multimodal sample with:
🔧 Fix Implementation
Solution: Unified Vision Processing
Replace
process_image()fromverl.utils.dataset.vision_utilswithprocess_vision_info()fromqwen_vl_utilsto match Agent Loop's behavior.Key Changes in
maybe_filter_out_long_prompts()Why This Fix Works
process_vision_info(messages)extracts images/videos frommessages.content, not fromdoc[image_key]pop()issue because images are already embedded in messagesimage_patch_size,video_metadatas,return_tensors,do_sample_frames🧪 Verification
Test Setup
filter_overlong_prompts=True,max_prompt_length=2048Results
RuntimeError: Expected size 2048 but got size 2597Log Evidence
Before Fix:
After Fix:
🎯 Benefits of This Fix
filter_overlong_promptsworks as intended📝 Additional Notes
image_patch_sizecalculation with the processor's actual configurationRelated Files:
verl/utils/dataset/rl_dataset.py(fixed)verl/experimental/agent_loop/agent_loop.py(reference implementation)