Fix skip softmax calibration memory issue by rohansjoshi · Pull Request #923 · NVIDIA/Model-Optimizer

rohansjoshi · 2026-02-24T00:51:11Z

Fix OOM issue when running skip softmax calibration

Test:

python examples/llm_sparsity/attention_sparsity/hf_sa.py \
    --pyt_ckpt_path Qwen/Qwen3-30B-Instruct-A3B-2507 \
    --sparse_attn skip_softmax_calib

works with >= 96GB GPU memory

copy-pr-bot · 2026-02-24T00:51:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-24T00:51:31Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Configuration values updated in attention sparsity settings. Target sparse ratio reduced from 0.9 to 0.5 for both prefill and decode modes, and maximum sequence length reduced from 65536 to 16384.

Changes

Cohort / File(s)	Summary
Configuration Updates `modelopt/torch/sparsity/attention_sparsity/config.py`	Updated SKIP_SOFTMAX_CALIB sparse_cfg calibration parameters: reduced target_sparse_ratio values from 0.9 to 0.5 for prefill and decode, and decreased max_seqlen from 65536 to 16384.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title 'Fix skip softmax calibration memory issue' accurately describes the primary change—adjusting calibration settings to address a memory concern by reducing max_seqlen and sparsity ratios.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch rohjoshi/sparse

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

modelopt/torch/sparsity/attention_sparsity/config.py (1)
404-404: max_seqlen in SKIP_SOFTMAX_CALIB now diverges from CalibrationConfig's default.

SKIP_SOFTMAX_CALIB sets max_seqlen = 16384, but CalibrationConfig.max_seqlen still defaults to 32768 (line 187). A user constructing CalibrationConfig() directly will get a different effective ceiling than a user relying on SKIP_SOFTMAX_CALIB. Consider aligning the two, or add a comment to SKIP_SOFTMAX_CALIB noting the deliberate divergence.

Additionally, for models commonly used with sequences > 16384 tokens (e.g., 32K/128K-context variants), the exponential threshold model will be extrapolating beyond its calibrated range, which may degrade calibration quality at those lengths.
💡 Aligning `CalibrationConfig.max_seqlen` default with `SKIP_SOFTMAX_CALIB`
     max_seqlen: int = ModeloptField(
-        default=32768,
+        default=16384,
         title="Maximum sequence length",
         description="Maximum sequence length for calibration (length bins auto-generated as powers of 2).",
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/sparsity/attention_sparsity/config.py` at line 404,
SKIP_SOFTMAX_CALIB sets "max_seqlen = 16384" but CalibrationConfig.max_seqlen
defaults to 32768, causing inconsistent ceilings; update them to match or
document the intentional divergence. Fix by either (A) changing
CalibrationConfig.max_seqlen default to 16384 to align with SKIP_SOFTMAX_CALIB,
or (B) updating the SKIP_SOFTMAX_CALIB entry to set max_seqlen =
CalibrationConfig.max_seqlen (or add a comment on why 16384 is intentionally
lower), and add a comment on the extrapolation risk for contexts >16384 so
callers know calibration may degrade; reference the identifiers
SKIP_SOFTMAX_CALIB and CalibrationConfig.max_seqlen when applying the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/sparsity/attention_sparsity/config.py`:
- Line 404: SKIP_SOFTMAX_CALIB sets "max_seqlen = 16384" but
CalibrationConfig.max_seqlen defaults to 32768, causing inconsistent ceilings;
update them to match or document the intentional divergence. Fix by either (A)
changing CalibrationConfig.max_seqlen default to 16384 to align with
SKIP_SOFTMAX_CALIB, or (B) updating the SKIP_SOFTMAX_CALIB entry to set
max_seqlen = CalibrationConfig.max_seqlen (or add a comment on why 16384 is
intentionally lower), and add a comment on the extrapolation risk for contexts
>16384 so callers know calibration may degrade; reference the identifiers
SKIP_SOFTMAX_CALIB and CalibrationConfig.max_seqlen when applying the change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52e662d and d0c0f51.

📒 Files selected for processing (1)

modelopt/torch/sparsity/attention_sparsity/config.py

kevalmorabia97 · 2026-02-26T21:38:36Z

@rohansjoshi You need to sign your commits with an SSH key. Please take a look at https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md#%EF%B8%8F-signing-your-work

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

Three sources of unnecessary memory allocation during calibration: 1. flash_skip_softmax.py: In calc_correction_factor_and_p, `p` (full attention-matrix-sized float tensor) and `p_larger_than_thresh` (same size boolean tensor) were both alive simultaneously alongside blocked_attn. Fuse the subtraction and comparison into a single expression to avoid materializing `p`, and explicitly del block_max, block_max_larger, block_max_cummax, and p_larger_than_thresh as soon as each is no longer needed. Applies to both prefill and decode paths. 2. calibrate.py (chunked prefill): del outputs after extracting past_key_values in each chunk to free logits between chunks. 3. calibrate.py (decode loop): del outputs after the prefill step to free the large [B, seqlen, vocab] logits tensor before the decode loop, and del outputs inside each decode step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

codecov · 2026-02-27T01:14:50Z

Codecov Report

❌ Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.14%. Comparing base (35e6099) to head (4534bd7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...arsity/attention_sparsity/calibration/calibrate.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #923      +/-   ##
==========================================
- Coverage   72.15%   72.14%   -0.01%     
==========================================
  Files         210      210              
  Lines       23515    23522       +7     
==========================================
+ Hits        16967    16971       +4     
- Misses       6548     6551       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rohansjoshi requested a review from a team as a code owner February 24, 2026 00:51

rohansjoshi requested a review from RalphMao February 24, 2026 00:51

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

rohansjoshi force-pushed the rohjoshi/sparse branch from d0c0f51 to 6f71478 Compare February 26, 2026 19:49

rohansjoshi changed the title ~~Fix skip softmax defaults~~ Fix skip softmax calibration memory issue Feb 26, 2026

rohansjoshi force-pushed the rohjoshi/sparse branch from 6f71478 to 44f19ea Compare February 26, 2026 20:12

rohansjoshi requested a review from kaix-nv February 26, 2026 20:12

rohansjoshi force-pushed the rohjoshi/sparse branch from 44f19ea to fd724a4 Compare February 26, 2026 20:15

kaix-nv approved these changes Feb 26, 2026

View reviewed changes

rohansjoshi requested review from kevalmorabia97 and removed request for RalphMao February 26, 2026 21:26

rohansjoshi and others added 2 commits February 26, 2026 22:26

Fix skip softmax defaults

2f27cfa

Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>

rohansjoshi force-pushed the rohjoshi/sparse branch from fd724a4 to ba83675 Compare February 26, 2026 22:32

Merge branch 'main' into rohjoshi/sparse

4534bd7

rohansjoshi enabled auto-merge (squash) February 27, 2026 21:44

rohansjoshi merged commit a538f2e into main Feb 27, 2026
37 checks passed

rohansjoshi deleted the rohjoshi/sparse branch February 27, 2026 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix skip softmax calibration memory issue#923

Fix skip softmax calibration memory issue#923
rohansjoshi merged 3 commits intomainfrom
rohjoshi/sparse

rohansjoshi commented Feb 24, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 24, 2026

Uh oh!

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

kevalmorabia97 commented Feb 26, 2026

Uh oh!

codecov bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rohansjoshi commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Feb 24, 2026

Uh oh!

coderabbitai bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 commented Feb 26, 2026

Uh oh!

codecov bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rohansjoshi commented Feb 24, 2026 •

edited

Loading

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

codecov bot commented Feb 27, 2026 •

edited

Loading