Skip to content

Conversation

@jianjunzhong
Copy link
Contributor

@jianjunzhong jianjunzhong commented Jan 21, 2026

What does this PR do?

If use_gpu is True, then use get_device name() for automatic device detection in RayWorkerGroup and ResourcePool instead of passing the device_name parameter.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

…ead of by parameter passing

Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes involve removing the device_name parameter from the get_placement_groups method in RayResourcePool and updating the call sites in the tests and RayWorkerGroup to reflect this change. The code now uses get_resource_name() and get_device_name() to dynamically determine the device name based on available hardware. The device_name parameter in the __call__ method of RayClassWithInitArgs is replaced with **kwargs. Review comments emphasize the importance of verifying that placement groups are correctly retrieved without explicitly passing the device name, ensuring get_resource_name() and get_device_name() consistently return the correct resource and device names, and preventing unexpected keyword arguments from being passed to the __call__ method.

I am having trouble creating individual review comments. Click here to see my feedback.

verl/single_controller/ray/base.py (115)

critical

The device_name parameter is removed from the get_placement_groups method. Ensure that all call sites are updated to reflect this change. It's crucial to verify that the device name is correctly being retrieved and used within the function, especially when use_gpu is True.

def get_placement_groups(self, strategy="STRICT_PACK", name=None):

verl/single_controller/ray/base.py (249-252)

critical

The device_name parameter is removed from the get_placement_groups method call. Verify that the placement groups are still correctly retrieved without explicitly passing the device name. This is critical for ensuring that the correct resources are allocated.

    merged = type(rp1)(new_store, rp1.use_gpu, f"{rp1.name_prefix}_{rp2.name_prefix}")
    merged.pgs = rp1.get_placement_groups() + rp2.get_placement_groups()

    return merged

verl/single_controller/ray/base.py (469)

critical

The device_name parameter is removed from the get_placement_groups method call. Verify that the placement groups are still correctly retrieved without explicitly passing the device name. This is critical for ensuring that the correct resources are allocated.

        pgs = resource_pool.get_placement_groups(strategy=strategy)

verl/single_controller/ray/base.py (505)

critical

The device_name parameter is removed from the get_placement_groups method call. Verify that the placement groups are still correctly retrieved without explicitly passing the device name. This is critical for ensuring that the correct resources are allocated.

        pgs = resource_pool.get_placement_groups(strategy=strategy)

verl/single_controller/ray/base.py (123-128)

high

The code now uses get_resource_name() to determine the device name. It's important to ensure that get_resource_name() consistently returns the correct resource name ('GPU' or 'NPU') based on the available hardware. Incorrect resource names could lead to resource allocation failures or incorrect device usage.

verl/single_controller/ray/base.py (288)

high

The device_name parameter is removed and replaced with **kwargs. It's important to ensure that no unexpected keyword arguments are passed to the function, which could lead to errors or unexpected behavior. Consider adding input validation or filtering to prevent this.

verl/single_controller/ray/base.py (316-321)

high

The code now uses get_device_name() to determine the device name within the __call__ method. It's crucial to ensure that get_device_name() consistently returns the correct device name ('cuda' or 'npu') based on the available hardware. Incorrect device names could lead to resource allocation failures or incorrect device usage.

Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants