Skip to content

skip generate option for large models and mxfp8#942

Open
arendu wants to merge 4 commits intoNVIDIA:mainfrom
arendu:adithyare/layerwise_mxfp8
Open

skip generate option for large models and mxfp8#942
arendu wants to merge 4 commits intoNVIDIA:mainfrom
arendu:adithyare/layerwise_mxfp8

Conversation

@arendu
Copy link

@arendu arendu commented Feb 26, 2026

What does this PR do?

Type of change: New feature

Overview: Adds a --skip_generate flag to hf_ptq.py that skips the pre/post-quantization generation preview calls. These calls run model.generate() which crashes for very large models (500B+) that are split across GPU and CPU via device_map="auto" (e.g., models with Mamba/Triton kernels that cannot handle CPU-offloaded tensors).

Usage

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path /path/to/model \
    --export_path /path/to/output \
    --qformat mxfp8 \
    --trust_remote_code \
    --export_fmt hf \
    --batch_size 1 \
    --skip_generate \
    --kv_cache_qformat none

Testing

Tested with a 500B parameter NemotronH hybrid Mamba/attention model on 4x GB200 GPUs. Without --skip_generate, the script crashes at model.generate() due to Mamba Triton kernels failing on CPU-offloaded tensors. With --skip_generate, the generation preview is skipped and quantization proceeds normally.

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

The --skip_generate flag sets generated_ids_before_ptq = None early, which also causes the post-quantization generate to be skipped via the existing if generated_ids_before_ptq is None: pass guard. Combined with --batch_size 1 (to skip the get_max_batch_size forward-pass probe), this eliminates all forward passes that can crash for device-map-split models.

Summary by CodeRabbit

  • New Features
    • Introduced --skip_generate CLI option to skip pre-quantization text and image generation, reducing processing time for very large models. Useful when generation previews are computationally expensive.

@arendu arendu requested a review from a team as a code owner February 26, 2026 23:36
@arendu arendu requested a review from meenchen February 26, 2026 23:36
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new CLI flag --skip_generate is added to the PTQ script, allowing users to skip pre-quantization text and image generation. When enabled, the generation step is bypassed by setting generated_ids_before_ptq to None.

Changes

Cohort / File(s) Summary
CLI and Generation Control
examples/llm_ptq/hf_ptq.py
Added --skip_generate CLI option to parse_args. Modified pre_quantize function with conditional logic to bypass generation when flag is set, while maintaining existing behavior for DeepSeek models and Nemotron VL preview paths.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'skip generate option for large models and mxfp8' directly and clearly summarizes the main change: adding a --skip_generate CLI option to support quantization of large models, particularly with mxfp8 quantization.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1083-1092: The help text for the --skip_generate argparse flag
overstates its effect; update the parser.add_argument call for "--skip_generate"
so the description accurately says it only skips the pre/post-quantization
model.generate() preview calls (not forward passes, calibration, or the
batch-size probe path used when --batch_size 0). Edit the string passed to
parser.add_argument in hf_ptq.py to replace “cannot run forward passes” with
wording like “cannot run model.generate() previews” and explicitly note that
calibration and batch-size probing are unaffected.
- Around line 689-691: The preview input extraction (preview_input_ids) is still
executed even when args.skip_generate is true, causing an unnecessary dataloader
fetch that can fail; update the logic around args.skip_generate and
generated_ids_before_ptq so that when args.skip_generate is set you
short-circuit before any preview_input_ids or dataloader reads (set
generated_ids_before_ptq and preview_input_ids to None or skip their
assignment), i.e., move or guard the preview_input_ids extraction behind the `if
not args.skip_generate` path (the branch that currently tests model_type ==
"deepseek" and subsequent generation logic) so no preview/dataloader work runs
when generation is disabled.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6cbcba and b67592f.

📒 Files selected for processing (1)
  • examples/llm_ptq/hf_ptq.py

Comment on lines +689 to +691
if args.skip_generate:
generated_ids_before_ptq = None
elif model_type == "deepseek":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Short-circuit preview input extraction when --skip_generate is set.

Even with --skip_generate, the code still fetches preview_input_ids at Line 684 before this branch. That extra batch fetch is unnecessary and can still fail on edge dataloader schemas while generation is intentionally disabled.

Proposed fix
 def pre_quantize(
@@
-    # Only run single sample for preview
-    preview_input_ids = next(iter(calib_dataloader))[
-        "input_features" if model_type == "whisper" else "input_ids"
-    ][0:1]
-
-    # Generate preview before quantization
-    if args.skip_generate:
-        generated_ids_before_ptq = None
+    preview_input_ids = None
+    # Generate preview before quantization
+    if args.skip_generate:
+        generated_ids_before_ptq = None
     elif model_type == "deepseek":
+        preview_input_ids = next(iter(calib_dataloader))[
+            "input_features" if model_type == "whisper" else "input_ids"
+        ][0:1]
         # DeepSeek generation may go OOM, so we skip it
         generated_ids_before_ptq = None
     elif is_nemotron_vl_model and tokenizer is not None:
+        preview_input_ids = next(iter(calib_dataloader))[
+            "input_features" if model_type == "whisper" else "input_ids"
+        ][0:1]
         generated_ids_before_ptq = run_nemotron_vl_preview(
@@
     else:
+        preview_input_ids = next(iter(calib_dataloader))[
+            "input_features" if model_type == "whisper" else "input_ids"
+        ][0:1]
         generated_ids_before_ptq = full_model.generate(preview_input_ids, max_new_tokens=100)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 689 - 691, The preview input
extraction (preview_input_ids) is still executed even when args.skip_generate is
true, causing an unnecessary dataloader fetch that can fail; update the logic
around args.skip_generate and generated_ids_before_ptq so that when
args.skip_generate is set you short-circuit before any preview_input_ids or
dataloader reads (set generated_ids_before_ptq and preview_input_ids to None or
skip their assignment), i.e., move or guard the preview_input_ids extraction
behind the `if not args.skip_generate` path (the branch that currently tests
model_type == "deepseek" and subsequent generation logic) so no
preview/dataloader work runs when generation is disabled.

Copy link
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the PR title?

@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.12%. Comparing base (35e6099) to head (76c8a62).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #942      +/-   ##
==========================================
- Coverage   72.15%   72.12%   -0.03%     
==========================================
  Files         210      210              
  Lines       23515    23515              
==========================================
- Hits        16967    16961       -6     
- Misses       6548     6554       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arendu arendu force-pushed the adithyare/layerwise_mxfp8 branch from c53b00d to 172f485 Compare February 27, 2026 22:03
@arendu arendu changed the title Adithyare/layerwise mxfp8 skip generate option for large models and mxfp8 Feb 27, 2026
@cjluo-nv
Copy link
Collaborator

/ok to test 172f485

arendu and others added 4 commits February 27, 2026 14:08
Signed-off-by: adithyare <adithyare@nvidia.com>
Signed-off-by: adithyare <adithyare@nvidia.com>
Signed-off-by: adithyare <adithyare@nvidia.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Adi Renduchintala <adithya.r@gmail.com>
@cjluo-nv cjluo-nv enabled auto-merge (squash) February 27, 2026 22:08
auto-merge was automatically disabled February 27, 2026 22:09

Head branch was pushed to by a user without write access

@arendu arendu force-pushed the adithyare/layerwise_mxfp8 branch from 172f485 to 76c8a62 Compare February 27, 2026 22:09
@cjluo-nv
Copy link
Collaborator

/ok to test 76c8a62

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants