Enable RTX Pro 6000 Blackwell runners for CI/CD#944
Enable RTX Pro 6000 Blackwell runners for CI/CD#944kevalmorabia97 wants to merge 2 commits intomainfrom
Conversation
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
📝 WalkthroughWalkthroughThe pull request refactors two GitHub Actions workflow files to introduce shared YAML anchors for GPU test strategies and consolidate runner configurations. Updates include adopting reusable matrix definitions, changing runner images from L4/H100 variants to RTX Pro 6000 variants, and adding container environment variables to GPU test jobs. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.github/workflows/gpu_tests.yml (1)
70-70: Custom runner labels are valid for self-hosted runners — consider adding actionlint config.The static analysis warnings about unknown runner labels (
linux-amd64-gpu-rtxpro6000-latest-1,linux-amd64-gpu-rtxpro6000-latest-2) are false positives. These are custom labels for NVIDIA's self-hosted GPU runners.To suppress these warnings in future CI runs, consider adding an
.github/actionlint.yamlconfig file:self-hosted-runner: labels: - linux-amd64-gpu-rtxpro6000-latest-1 - linux-amd64-gpu-rtxpro6000-latest-2 - linux-amd64-gpu-h100-latest-1 - linux-amd64-gpu-l4-latest-1Also applies to: 89-89
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gpu_tests.yml at line 70, Add an actionlint configuration file that declares the custom self-hosted runner labels used by the workflow (the unknown labels are: linux-amd64-gpu-rtxpro6000-latest-1, linux-amd64-gpu-rtxpro6000-latest-2, linux-amd64-gpu-h100-latest-1, linux-amd64-gpu-l4-latest-1) so actionlint stops flagging them as invalid; create or update the actionlint config named actionlint.yaml with a top-level self-hosted-runner.labels array containing those label strings (or merge into the existing actionlint config if present) to suppress the false positive warnings for the gpu_tests.yml workflow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/gpu_tests.yml:
- Line 70: Add an actionlint configuration file that declares the custom
self-hosted runner labels used by the workflow (the unknown labels are:
linux-amd64-gpu-rtxpro6000-latest-1, linux-amd64-gpu-rtxpro6000-latest-2,
linux-amd64-gpu-h100-latest-1, linux-amd64-gpu-l4-latest-1) so actionlint stops
flagging them as invalid; create or update the actionlint config named
actionlint.yaml with a top-level self-hosted-runner.labels array containing
those label strings (or merge into the existing actionlint config if present) to
suppress the false positive warnings for the gpu_tests.yml workflow.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #944 +/- ##
==========================================
+ Coverage 72.15% 72.16% +0.01%
==========================================
Files 210 210
Lines 23515 23515
==========================================
+ Hits 16967 16970 +3
+ Misses 6548 6545 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
55956af to
6a09835
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/gpu/torch/quantization/test_hadamard.py`:
- Around line 23-28: The module-level probe currently catches all exceptions;
change it to only skip for CUDA-unavailability errors by catching explicit
exception types (e.g., RuntimeError and torch.cuda.CudaError if available)
around the fast_hadamard_transform.hadamard_transform(torch.randn(1, 2,
device="cuda")) call, call pytest.skip(...) only for those exceptions, and
re-raise any other exceptions so real failures surface; reference the
fast_hadamard_transform.hadamard_transform call and pytest.skip usage when
making this change.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
.github/workflows/_example_tests_runner.yml.github/workflows/example_tests.yml.github/workflows/gpu_tests.ymltests/_test_utils/import_helper.pytests/_test_utils/torch/megatron/models.pytests/_test_utils/torch/megatron/utils.pytests/gpu/torch/quantization/test_hadamard.pytests/gpu_megatron/_extensionstests/gpu_megatron/_extensions/test_torch_extensions.pytests/gpu_megatron/torch/nas/plugins/test_megatron_mamba_dynamic_modules.pytests/gpu_megatron/torch/prune/plugins/test_mcore_mamba_minitron_pruning.pytests/gpu_trtllm/_extensions/test_torch_extensions.pytests/gpu_trtllm/torch/quantization/backends/test_fp8_per_tensor_gemm.pytests/gpu_trtllm/torch/quantization/backends/test_gemm_common.pytests/gpu_trtllm/torch/quantization/backends/test_gemm_registry.pytests/gpu_trtllm/torch/quantization/backends/test_nvfp4_gemm.pytox.ini
💤 Files with no reviewable changes (2)
- tests/gpu_megatron/_extensions
- tests/_test_utils/torch/megatron/utils.py
✅ Files skipped from review due to trivial changes (1)
- tests/gpu_trtllm/_extensions/test_torch_extensions.py
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
c5b4c21 to
0fb53c9
Compare
What does this PR do?
Type of change: CI/CD Improvement
Updated CI/CD test matrix (new
cuda13-gpu-trtllmdedicated job for gpu tests on trtllm container)cuda13-gpu,cuda13-gpu-megatron,cuda13-gpu-trtllmcuda13-gpu,cuda13-gpu-megatron,cuda13-gpu-trtllmllm_distill,llm_qat,llm_sparsity,speculative_decodingllm_distill,llm_qat,llm_sparsity,speculative_decodingllm_ptq,vlm_ptqllm_autodeploy,llm_eval,llm_ptq,vlm_ptqdiffusers,torch_onnxdiffusers,torch_onnxTesting
Summary by CodeRabbit