Skip to content

Enable RTX Pro 6000 Blackwell runners for CI/CD#944

Open
kevalmorabia97 wants to merge 2 commits intomainfrom
kmorabia/rtxpro-cicd
Open

Enable RTX Pro 6000 Blackwell runners for CI/CD#944
kevalmorabia97 wants to merge 2 commits intomainfrom
kmorabia/rtxpro-cicd

Conversation

@kevalmorabia97
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 commented Feb 27, 2026

What does this PR do?

Type of change: CI/CD Improvement

Updated CI/CD test matrix (new cuda13-gpu-trtllm dedicated job for gpu tests on trtllm container)

Workflow Trigger Test Matrix GPU Runner
GPU tests PR cuda13-gpu, cuda13-gpu-megatron, cuda13-gpu-trtllm 1x RTX Pro 6000
GPU tests Nightly cuda13-gpu, cuda13-gpu-megatron, cuda13-gpu-trtllm 2x RTX Pro 6000
Example tests (torch) PR llm_distill, llm_qat, llm_sparsity, speculative_decoding 1x H100
Example tests (torch) Nightly llm_distill, llm_qat, llm_sparsity, speculative_decoding 2x RTX Pro 6000
Example tests (trtllm) PR llm_ptq, vlm_ptq 1x RTX Pro 6000
Example tests (trtllm) Nightly llm_autodeploy, llm_eval, llm_ptq, vlm_ptq 2x RTX Pro 6000
Example tests (onnx) PR diffusers, torch_onnx 1x L4
Example tests (onnx) Nightly diffusers, torch_onnx 2x RTX Pro 6000

Testing

Summary by CodeRabbit

  • Chores
    • Unified and simplified CI test matrices and reduced duplicate workflow configuration.
    • Updated GPU runner targets and container images for test jobs to newer GPU types.
  • Tests
    • Added new GPU test variants and several new test modules.
    • Simplified test gating: megatron auto-skip removed; tests now use a single dependency check for the mamba provider.
  • Tooling
    • Added a new tox environment for an additional GPU test suite.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner February 27, 2026 10:38
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 27, 2026

📝 Walkthrough

Walkthrough

The pull request refactors two GitHub Actions workflow files to introduce shared YAML anchors for GPU test strategies and consolidate runner configurations. Updates include adopting reusable matrix definitions, changing runner images from L4/H100 variants to RTX Pro 6000 variants, and adding container environment variables to GPU test jobs.

Changes

Cohort / File(s) Summary
Example Tests Workflow
.github/workflows/example_tests.yml
Introduces shared anchors (&torch_strategy, &onnx_strategy) for GPU test matrices to reduce duplication. Updates runner images: L4 → H100 for torch jobs; H100 → RTX Pro 6000 for ONNX and TensorRT-LLM jobs. TensorRT-LLM pr job matrix expanded to include vlm_ptq alongside llm_ptq. Multiple jobs refactored to reuse anchors via *anchor syntax.
GPU Tests Workflow
.github/workflows/gpu_tests.yml
Introduces shared &gpu_strategy anchor applied to both gpu-tests-pr and gpu-tests-non-pr jobs. Runner updated from L4 to RTX Pro 6000 for pr job, H100 variant to RTX Pro 6000 for non-pr. Adds container environment variables (GIT_DEPTH, PIP_CONSTRAINT, HF_TOKEN) to pr job. Matrix definitions consolidated under shared anchor.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: enabling RTX Pro 6000 Blackwell runners for CI/CD, which aligns with the primary modifications shown in the changeset across GPU and example test workflows.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kmorabia/rtxpro-cicd

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/gpu_tests.yml (1)

70-70: Custom runner labels are valid for self-hosted runners — consider adding actionlint config.

The static analysis warnings about unknown runner labels (linux-amd64-gpu-rtxpro6000-latest-1, linux-amd64-gpu-rtxpro6000-latest-2) are false positives. These are custom labels for NVIDIA's self-hosted GPU runners.

To suppress these warnings in future CI runs, consider adding an .github/actionlint.yaml config file:

self-hosted-runner:
  labels:
    - linux-amd64-gpu-rtxpro6000-latest-1
    - linux-amd64-gpu-rtxpro6000-latest-2
    - linux-amd64-gpu-h100-latest-1
    - linux-amd64-gpu-l4-latest-1

Also applies to: 89-89

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/gpu_tests.yml at line 70, Add an actionlint configuration
file that declares the custom self-hosted runner labels used by the workflow
(the unknown labels are: linux-amd64-gpu-rtxpro6000-latest-1,
linux-amd64-gpu-rtxpro6000-latest-2, linux-amd64-gpu-h100-latest-1,
linux-amd64-gpu-l4-latest-1) so actionlint stops flagging them as invalid;
create or update the actionlint config named actionlint.yaml with a top-level
self-hosted-runner.labels array containing those label strings (or merge into
the existing actionlint config if present) to suppress the false positive
warnings for the gpu_tests.yml workflow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/gpu_tests.yml:
- Line 70: Add an actionlint configuration file that declares the custom
self-hosted runner labels used by the workflow (the unknown labels are:
linux-amd64-gpu-rtxpro6000-latest-1, linux-amd64-gpu-rtxpro6000-latest-2,
linux-amd64-gpu-h100-latest-1, linux-amd64-gpu-l4-latest-1) so actionlint stops
flagging them as invalid; create or update the actionlint config named
actionlint.yaml with a top-level self-hosted-runner.labels array containing
those label strings (or merge into the existing actionlint config if present) to
suppress the false positive warnings for the gpu_tests.yml workflow.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35e6099 and c9ebe8c.

📒 Files selected for processing (2)
  • .github/workflows/example_tests.yml
  • .github/workflows/gpu_tests.yml

@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.16%. Comparing base (35e6099) to head (0fb53c9).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #944      +/-   ##
==========================================
+ Coverage   72.15%   72.16%   +0.01%     
==========================================
  Files         210      210              
  Lines       23515    23515              
==========================================
+ Hits        16967    16970       +3     
+ Misses       6548     6545       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/rtxpro-cicd branch 3 times, most recently from 55956af to 6a09835 Compare February 27, 2026 13:05
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/gpu/torch/quantization/test_hadamard.py`:
- Around line 23-28: The module-level probe currently catches all exceptions;
change it to only skip for CUDA-unavailability errors by catching explicit
exception types (e.g., RuntimeError and torch.cuda.CudaError if available)
around the fast_hadamard_transform.hadamard_transform(torch.randn(1, 2,
device="cuda")) call, call pytest.skip(...) only for those exceptions, and
re-raise any other exceptions so real failures surface; reference the
fast_hadamard_transform.hadamard_transform call and pytest.skip usage when
making this change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9ebe8c and 6a09835.

📒 Files selected for processing (17)
  • .github/workflows/_example_tests_runner.yml
  • .github/workflows/example_tests.yml
  • .github/workflows/gpu_tests.yml
  • tests/_test_utils/import_helper.py
  • tests/_test_utils/torch/megatron/models.py
  • tests/_test_utils/torch/megatron/utils.py
  • tests/gpu/torch/quantization/test_hadamard.py
  • tests/gpu_megatron/_extensions
  • tests/gpu_megatron/_extensions/test_torch_extensions.py
  • tests/gpu_megatron/torch/nas/plugins/test_megatron_mamba_dynamic_modules.py
  • tests/gpu_megatron/torch/prune/plugins/test_mcore_mamba_minitron_pruning.py
  • tests/gpu_trtllm/_extensions/test_torch_extensions.py
  • tests/gpu_trtllm/torch/quantization/backends/test_fp8_per_tensor_gemm.py
  • tests/gpu_trtllm/torch/quantization/backends/test_gemm_common.py
  • tests/gpu_trtllm/torch/quantization/backends/test_gemm_registry.py
  • tests/gpu_trtllm/torch/quantization/backends/test_nvfp4_gemm.py
  • tox.ini
💤 Files with no reviewable changes (2)
  • tests/gpu_megatron/_extensions
  • tests/_test_utils/torch/megatron/utils.py
✅ Files skipped from review due to trivial changes (1)
  • tests/gpu_trtllm/_extensions/test_torch_extensions.py

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant