skip generate option for large models and mxfp8 by arendu · Pull Request #942 · NVIDIA/Model-Optimizer

arendu · 2026-02-26T23:36:49Z

What does this PR do?

Type of change: New feature

Overview: Adds a --skip_generate flag to hf_ptq.py that skips the pre/post-quantization generation preview calls. These calls run model.generate() which crashes for very large models (500B+) that are split across GPU and CPU via device_map="auto" (e.g., models with Mamba/Triton kernels that cannot handle CPU-offloaded tensors).

Usage

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path /path/to/model \
    --export_path /path/to/output \
    --qformat mxfp8 \
    --trust_remote_code \
    --export_fmt hf \
    --batch_size 1 \
    --skip_generate \
    --kv_cache_qformat none

Testing

Tested with a 500B parameter NemotronH hybrid Mamba/attention model on 4x GB200 GPUs. Without --skip_generate, the script crashes at model.generate() due to Mamba Triton kernels failing on CPU-offloaded tensors. With --skip_generate, the generation preview is skipped and quantization proceeds normally.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

The --skip_generate flag sets generated_ids_before_ptq = None early, which also causes the post-quantization generate to be skipped via the existing if generated_ids_before_ptq is None: pass guard. Combined with --batch_size 1 (to skip the get_max_batch_size forward-pass probe), this eliminates all forward passes that can crash for device-map-split models.

Summary by CodeRabbit

New Features
- Introduced --skip_generate CLI option to skip pre-quantization text and image generation, reducing processing time for very large models. Useful when generation previews are computationally expensive.

copy-pr-bot · 2026-02-26T23:36:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-26T23:37:04Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

A new CLI flag --skip_generate is added to the PTQ script, allowing users to skip pre-quantization text and image generation. When enabled, the generation step is bypassed by setting generated_ids_before_ptq to None.

Changes

Cohort / File(s)	Summary
CLI and Generation Control `examples/llm_ptq/hf_ptq.py`	Added `--skip_generate` CLI option to parse_args. Modified pre_quantize function with conditional logic to bypass generation when flag is set, while maintaining existing behavior for DeepSeek models and Nemotron VL preview paths.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'skip generate option for large models and mxfp8' directly and clearly summarizes the main change: adding a --skip_generate CLI option to support quantization of large models, particularly with mxfp8 quantization.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1083-1092: The help text for the --skip_generate argparse flag
overstates its effect; update the parser.add_argument call for "--skip_generate"
so the description accurately says it only skips the pre/post-quantization
model.generate() preview calls (not forward passes, calibration, or the
batch-size probe path used when --batch_size 0). Edit the string passed to
parser.add_argument in hf_ptq.py to replace “cannot run forward passes” with
wording like “cannot run model.generate() previews” and explicitly note that
calibration and batch-size probing are unaffected.
- Around line 689-691: The preview input extraction (preview_input_ids) is still
executed even when args.skip_generate is true, causing an unnecessary dataloader
fetch that can fail; update the logic around args.skip_generate and
generated_ids_before_ptq so that when args.skip_generate is set you
short-circuit before any preview_input_ids or dataloader reads (set
generated_ids_before_ptq and preview_input_ids to None or skip their
assignment), i.e., move or guard the preview_input_ids extraction behind the `if
not args.skip_generate` path (the branch that currently tests model_type ==
"deepseek" and subsequent generation logic) so no preview/dataloader work runs
when generation is disabled.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6cbcba and b67592f.

📒 Files selected for processing (1)

examples/llm_ptq/hf_ptq.py

coderabbitai · 2026-02-26T23:39:56Z

examples/llm_ptq/hf_ptq.py

+    if args.skip_generate:
+        generated_ids_before_ptq = None
+    elif model_type == "deepseek":


⚠️ Potential issue | 🟡 Minor

Short-circuit preview input extraction when --skip_generate is set.

Even with --skip_generate, the code still fetches preview_input_ids at Line 684 before this branch. That extra batch fetch is unnecessary and can still fail on edge dataloader schemas while generation is intentionally disabled.

Proposed fix

def pre_quantize( @@ - # Only run single sample for preview - preview_input_ids = next(iter(calib_dataloader))[ - "input_features" if model_type == "whisper" else "input_ids" - ][0:1] - - # Generate preview before quantization - if args.skip_generate: - generated_ids_before_ptq = None + preview_input_ids = None + # Generate preview before quantization + if args.skip_generate: + generated_ids_before_ptq = None elif model_type == "deepseek": + preview_input_ids = next(iter(calib_dataloader))[ + "input_features" if model_type == "whisper" else "input_ids" + ][0:1] # DeepSeek generation may go OOM, so we skip it generated_ids_before_ptq = None elif is_nemotron_vl_model and tokenizer is not None: + preview_input_ids = next(iter(calib_dataloader))[ + "input_features" if model_type == "whisper" else "input_ids" + ][0:1] generated_ids_before_ptq = run_nemotron_vl_preview( @@ else: + preview_input_ids = next(iter(calib_dataloader))[ + "input_features" if model_type == "whisper" else "input_ids" + ][0:1] generated_ids_before_ptq = full_model.generate(preview_input_ids, max_new_tokens=100)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 689 - 691, The preview input extraction (preview_input_ids) is still executed even when args.skip_generate is true, causing an unnecessary dataloader fetch that can fail; update the logic around args.skip_generate and generated_ids_before_ptq so that when args.skip_generate is set you short-circuit before any preview_input_ids or dataloader reads (set generated_ids_before_ptq and preview_input_ids to None or skip their assignment), i.e., move or guard the preview_input_ids extraction behind the `if not args.skip_generate` path (the branch that currently tests model_type == "deepseek" and subsequent generation logic) so no preview/dataloader work runs when generation is disabled.

examples/llm_ptq/hf_ptq.py

cjluo-nv

Could you update the PR title?

codecov · 2026-02-27T02:13:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.12%. Comparing base (35e6099) to head (76c8a62).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #942      +/-   ##
==========================================
- Coverage   72.15%   72.12%   -0.03%     
==========================================
  Files         210      210              
  Lines       23515    23515              
==========================================
- Hits        16967    16961       -6     
- Misses       6548     6554       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv · 2026-02-27T22:08:14Z

/ok to test 172f485

Signed-off-by: adithyare <adithyare@nvidia.com>

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Adi Renduchintala <adithya.r@gmail.com>

cjluo-nv · 2026-02-27T23:33:33Z

/ok to test 76c8a62

arendu requested a review from a team as a code owner February 26, 2026 23:36

arendu requested a review from meenchen February 26, 2026 23:36

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

cjluo-nv approved these changes Feb 27, 2026

View reviewed changes

arendu force-pushed the adithyare/layerwise_mxfp8 branch from c53b00d to 172f485 Compare February 27, 2026 22:03

arendu changed the title ~~Adithyare/layerwise mxfp8~~ skip generate option for large models and mxfp8 Feb 27, 2026

arendu and others added 4 commits February 27, 2026 14:08

layerwise mxfp8 conversion

1bee3d5

Signed-off-by: adithyare <adithyare@nvidia.com>

option to skip generate call for large models and mxfp8 quantization

07eacc7

Signed-off-by: adithyare <adithyare@nvidia.com>

removed layerwise file

57b705c

Signed-off-by: adithyare <adithyare@nvidia.com>

Update examples/llm_ptq/hf_ptq.py

76c8a62

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Adi Renduchintala <adithya.r@gmail.com>

cjluo-nv enabled auto-merge (squash) February 27, 2026 22:08

auto-merge was automatically disabled February 27, 2026 22:09
Head branch was pushed to by a user without write access

arendu force-pushed the adithyare/layerwise_mxfp8 branch from 172f485 to 76c8a62 Compare February 27, 2026 22:09

meenchen approved these changes Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip generate option for large models and mxfp8#942

skip generate option for large models and mxfp8#942
arendu wants to merge 4 commits intoNVIDIA:mainfrom
arendu:adithyare/layerwise_mxfp8

arendu commented Feb 26, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 26, 2026

Uh oh!

coderabbitai bot commented Feb 26, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 26, 2026

Uh oh!

Uh oh!

cjluo-nv left a comment

Uh oh!

codecov bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

cjluo-nv commented Feb 27, 2026

Uh oh!

cjluo-nv commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

arendu commented Feb 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 26, 2026

Uh oh!

coderabbitai bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cjluo-nv commented Feb 27, 2026

Uh oh!

cjluo-nv commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arendu commented Feb 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 26, 2026 •

edited

Loading

codecov bot commented Feb 27, 2026 •

edited

Loading