Fix skip softmax calibration memory issue#923
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughConfiguration values updated in attention sparsity settings. Target sparse ratio reduced from 0.9 to 0.5 for both prefill and decode modes, and maximum sequence length reduced from 65536 to 16384. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/sparsity/attention_sparsity/config.py (1)
404-404:max_seqleninSKIP_SOFTMAX_CALIBnow diverges fromCalibrationConfig's default.
SKIP_SOFTMAX_CALIBsetsmax_seqlen = 16384, butCalibrationConfig.max_seqlenstill defaults to32768(line 187). A user constructingCalibrationConfig()directly will get a different effective ceiling than a user relying onSKIP_SOFTMAX_CALIB. Consider aligning the two, or add a comment toSKIP_SOFTMAX_CALIBnoting the deliberate divergence.Additionally, for models commonly used with sequences > 16384 tokens (e.g., 32K/128K-context variants), the exponential threshold model will be extrapolating beyond its calibrated range, which may degrade calibration quality at those lengths.
💡 Aligning `CalibrationConfig.max_seqlen` default with `SKIP_SOFTMAX_CALIB`
max_seqlen: int = ModeloptField( - default=32768, + default=16384, title="Maximum sequence length", description="Maximum sequence length for calibration (length bins auto-generated as powers of 2).", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/sparsity/attention_sparsity/config.py` at line 404, SKIP_SOFTMAX_CALIB sets "max_seqlen = 16384" but CalibrationConfig.max_seqlen defaults to 32768, causing inconsistent ceilings; update them to match or document the intentional divergence. Fix by either (A) changing CalibrationConfig.max_seqlen default to 16384 to align with SKIP_SOFTMAX_CALIB, or (B) updating the SKIP_SOFTMAX_CALIB entry to set max_seqlen = CalibrationConfig.max_seqlen (or add a comment on why 16384 is intentionally lower), and add a comment on the extrapolation risk for contexts >16384 so callers know calibration may degrade; reference the identifiers SKIP_SOFTMAX_CALIB and CalibrationConfig.max_seqlen when applying the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/sparsity/attention_sparsity/config.py`:
- Line 404: SKIP_SOFTMAX_CALIB sets "max_seqlen = 16384" but
CalibrationConfig.max_seqlen defaults to 32768, causing inconsistent ceilings;
update them to match or document the intentional divergence. Fix by either (A)
changing CalibrationConfig.max_seqlen default to 16384 to align with
SKIP_SOFTMAX_CALIB, or (B) updating the SKIP_SOFTMAX_CALIB entry to set
max_seqlen = CalibrationConfig.max_seqlen (or add a comment on why 16384 is
intentionally lower), and add a comment on the extrapolation risk for contexts
>16384 so callers know calibration may degrade; reference the identifiers
SKIP_SOFTMAX_CALIB and CalibrationConfig.max_seqlen when applying the change.
d0c0f51 to
6f71478
Compare
6f71478 to
44f19ea
Compare
44f19ea to
fd724a4
Compare
|
@rohansjoshi You need to sign your commits with an SSH key. Please take a look at https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md#%EF%B8%8F-signing-your-work |
Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>
Three sources of unnecessary memory allocation during calibration: 1. flash_skip_softmax.py: In calc_correction_factor_and_p, `p` (full attention-matrix-sized float tensor) and `p_larger_than_thresh` (same size boolean tensor) were both alive simultaneously alongside blocked_attn. Fuse the subtraction and comparison into a single expression to avoid materializing `p`, and explicitly del block_max, block_max_larger, block_max_cummax, and p_larger_than_thresh as soon as each is no longer needed. Applies to both prefill and decode paths. 2. calibrate.py (chunked prefill): del outputs after extracting past_key_values in each chunk to free logits between chunks. 3. calibrate.py (decode loop): del outputs after the prefill step to free the large [B, seqlen, vocab] logits tensor before the decode loop, and del outputs inside each decode step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com>
fd724a4 to
ba83675
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #923 +/- ##
==========================================
- Coverage 72.15% 72.14% -0.01%
==========================================
Files 210 210
Lines 23515 23522 +7
==========================================
+ Hits 16967 16971 +4
- Misses 6548 6551 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Fix OOM issue when running skip softmax calibration
Test:
works with >= 96GB GPU memory