[WIP][algo] Migrate and implement the GDPO algorithm into the existing framework.#5422
[WIP][algo] Migrate and implement the GDPO algorithm into the existing framework.#5422Rhetee wants to merge 2 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the GDPO algorithm and its associated reward calculation logic. My review focuses on improving code correctness, robustness, and maintainability. I've identified a critical bug in the reward manager that would lead to a crash, a typo in a class name, and several areas for improvement in the new reward scoring script, including fragile parsing logic, use of print statements, and a variable name typo. Addressing these points will make the new implementation more robust and easier to maintain.
| if isinstance(result, dict): | ||
| score = result["score"] | ||
| for key, value in result.items(): | ||
| reward_extra_info[key] = value | ||
| else: | ||
| score = result | ||
| reward_extra_info["acc"] = score |
There was a problem hiding this comment.
The current logic for processing the result from compute_score is incorrect when the function returns a tuple, as verl.utils.reward_score.rlla.compute_score does. This will assign a tuple to the score variable, which will cause a runtime error later when it's expected to be a float (e.g., in torch.tensor(scores)). You need to handle the tuple case explicitly to correctly extract the score and other reward components.
| if isinstance(result, dict): | |
| score = result["score"] | |
| for key, value in result.items(): | |
| reward_extra_info[key] = value | |
| else: | |
| score = result | |
| reward_extra_info["acc"] = score | |
| if isinstance(result, dict): | |
| score = result["score"] | |
| reward_extra_info.update(result) | |
| elif isinstance(result, tuple) and len(result) == 4: | |
| score, format_score, correctness_score, length_score = result | |
| reward_extra_info["score"] = score | |
| reward_extra_info["format_score"] = format_score | |
| reward_extra_info["correctness_score"] = correctness_score | |
| reward_extra_info["length_score"] = length_score | |
| else: | |
| score = result | |
| reward_extra_info["acc"] = score |
|
|
||
|
|
||
| @register("gdpo") | ||
| class GDPOdRewardManager(RewardManagerBase): |
There was a problem hiding this comment.
There appears to be a typo in the class name GDPOdRewardManager. It should likely be GDPORewardManager to align with the algorithm name ("gdpo") it's registered for. This will improve clarity and prevent confusion.
| class GDPOdRewardManager(RewardManagerBase): | |
| class GDPORewardManager(RewardManagerBase): |
| return 1.0 | ||
|
|
||
| if os.getenv("REFINEDREWARD", 0) == "1": | ||
| print("REFINEDREWARD is set to 1, so strict match is used") |
There was a problem hiding this comment.
This file contains many print statements, likely for debugging. In a library, this is considered bad practice as it pollutes stdout and makes output difficult to control. Please replace these with a proper logging framework, such as Python's built-in logging module. This will allow for configurable log levels (e.g., DEBUG, INFO) and integrate better with the application's overall logging strategy.
| exp_name = str(os.getenv("EXPERIMENT_NAME", "")) | ||
| if "llama" in exp_name: | ||
| predict_str = ( | ||
| solution_str.split("<|start_header_id|>assistant<|end_header_id|>")[-1].split("<|eot_id|>")[0].strip() | ||
| ) | ||
| elif "qwen" in exp_name: | ||
| predict_str = solution_str.split("<|im_start|>assistant")[-1].split("<|im_end|>")[0].strip() | ||
| else: | ||
| raise NotImplementedError(f"Unknown model name: {exp_name}") |
There was a problem hiding this comment.
The logic for parsing the solution_str relies on hardcoded model name checks within an environment variable (EXPERIMENT_NAME). This approach is fragile and not scalable. It will break if the model name in the environment variable changes, or for new models not explicitly handled here. A more robust solution would be to either pass the tokenizer to this function to properly decode the response, or to perform the parsing in the calling context (which has the tokenizer) and pass the clean response string. Relying on string splitting with model-specific tokens is not a maintainable practice.
| completions = [[{"role": "assistant", "content": predict_str}]] | ||
| answer = [ground_truth] | ||
|
|
||
| fomrat_score = customize_format_reward_func(completions, answer, step, format_max_possible, format_min_possible)[0] |
There was a problem hiding this comment.
There is a typo in the variable name fomrat_score. It should be format_score. This typo is used consistently within the compute_score function, which harms readability and maintainability. Please correct it here and in its other usages within this function.
| fomrat_score = customize_format_reward_func(completions, answer, step, format_max_possible, format_min_possible)[0] | |
| format_score = customize_format_reward_func(completions, answer, step, format_max_possible, format_min_possible)[0] |
What does this PR do?
This PR references the original paper and adds necessary logic such as multi-reward scoring to reproduce the paper's results within the existing Verl framework.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
WIP
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.