Final EKFAC implementation #123

LouisYRYJ · 2026-01-13T13:43:24Z

This is it. The final EKFAC merge. Most of it is pretty self-contained and doesn't interfere much with the rest of the code, as we already had set-up much of the work before.

I did slight changes in the code to accomodate for some functionalties that we were missing. In particular, I need to pass a mask tensor to the CovarianceCollector to ensure that we only process the correct activations (and e.g. not the activations of the user tokens).
Technically this needs to happen for any collector, but GradientCollector for example gets around this because gradients will be always 0 at the masked tokens and when we do a*g, the masked token activations get multiplied by 0 and will be ignored (h/t @smarter).

bergson/hessians/eigenvectors.py

bergson/hessians/hessian_approximations.py

bergson/hessians/kfac.py

bergson/collector/collector.py

bergson/hessians/hessian_approximations.py

claude · 2026-01-13T16:27:00Z

Code Review

I've found 5 high-signal issues in this PR:

Issue 1: CLAUDE.md violation - torch.cuda.empty_cache() usage

File: bergson/hessians/eigenvectors.py
Lines: 310, 368

The code uses torch.cuda.empty_cache() which violates CLAUDE.md:

torch.cuda.empty_cache() doesn't do what you hope it will do - don't use it.

Fix: Remove both torch.cuda.empty_cache() calls. PyTorch's automatic memory management is sufficient.

Issue 2: Bug - Missing gradient masking in kfac.py

File: bergson/hessians/kfac.py
Lines: 68-89

The backward_hook does not apply valid_mask to filter gradients, while forward_hook correctly masks activations. This asymmetry causes:

Activation covariance computed on [num_valid, I] valid positions
Gradient covariance computed on [N*S, O] all positions (including padding)

This results in mathematically incorrect KFAC approximation since both covariances must be computed over the same data points.

Fix: Apply the same masking in backward_hook:

def backward_hook(self, module: nn.Module, g: Tensor) -> None:
    """Compute gradient covariance: G^T @ G."""
    name = assert_type(str, module._name)
    S_cov_po = self.S_cov_dict[name]
    mask = self._current_valid_mask
    assert mask is not None, "Valid mask not set for backward hook."
    
    # g: [N, S, O], valid_masks: [N, S] -> select valid positions
    g_bo = g[mask]  # [num_valid, O]
    
    # Compute local covariance
    local_update_oo = g_bo.mT @ g_bo
    # ... rest of the function

Issue 3: Bug - Missing gradient masking in tkfac.py

File: bergson/hessians/tkfac.py
Lines: 56-67

Same issue as in kfac.py: backward_hook reshapes gradients to [N*S, O] without masking, while forward_hook masks activations to [num_valid, I]. This creates inconsistent covariance computations.

Fix: Apply the same masking in backward_hook:

def backward_hook(self, module: nn.Module, g: Tensor) -> None:
    """Compute trace-weighted gradient covariance."""
    name = assert_type(str, module._name)
    S_tcov_po = self.S_tcov_dict[name]
    A_tcov_ki = self.A_tcov_dict[name]
    mask = self._current_valid_mask
    assert mask is not None, "Valid mask not set for backward hook."
    
    # g: [N, S, O], valid_masks: [N, S] -> select valid positions
    g_bo = g[mask]  # [num_valid, O]
    a_bi = module._inputs
    # ... rest of the function

Issue 4: Bug - Missing loss normalization when use_dataset_labels=True

File: bergson/collector/collector.py
Lines: 626-643

In fwd_bwd_hessian_factory, the use_dataset_labels=True branch is missing per-sample loss normalization:

When use_dataset_labels=False: losses is reduced to shape [batch_size] via losses.sum(1) / denoms
When use_dataset_labels=True: losses remains shape [batch_size, seq_len-1] (unreduced)

This inconsistency will cause issues in downstream code expecting per-sample losses.

Fix: Add normalization in the else branch:

else:
    # Use dataset labels instead of sampling
    losses = F.cross_entropy(
        logits[:, :-1].flatten(0, 1),
        y[:, 1:].flatten(0, 1),
        reduction="none",
        ignore_index=-100,
    ).reshape_as(y[:, 1:])
    losses = losses.sum(1) / denoms  # <-- Add this line

Issue 5: Bug - dtype string not converted to torch.dtype

File: bergson/hessians/hessian_approximations.py
Lines: 190-208

When hessian_cfg.hessian_dtype != "auto", a string like "bf16" is passed directly as the dtype parameter instead of a torch.dtype object. This will cause a TypeError when torch.zeros(..., dtype=dtype) is called.

Fix: Use convert_precision_to_torch() to convert the string:

hessian_dtype = (
    model.dtype
    if hessian_cfg.hessian_dtype == "auto"
    else convert_precision_to_torch(hessian_cfg.hessian_dtype)
)

Similar pattern is correctly implemented in bergson/score/score.py:284-288.

Summary

1 CLAUDE.md violation: Remove torch.cuda.empty_cache() calls
4 bugs:
- Missing gradient masking in both KFAC collectors (critical for correctness)
- Missing loss normalization (causes shape inconsistency)
- Missing dtype conversion (will cause runtime TypeError)

LouisYRYJ · 2026-01-13T17:02:38Z

@claude Give it another go. I think i fixed the issues now

claude · 2026-01-13T17:02:57Z

Claude finished @LouisYRYJ's task in 2m 10s —— View job

Re-reviewing PR after fixes

Read updated files to verify fixes
Check if CLAUDE.md violation was fixed
Check if loss normalization was fixed
Check if dtype conversion was fixed
Re-evaluate gradient masking issues
Provide updated review feedback

Review Results

Great progress! Most issues have been fixed. Here's the status:

✅ Fixed Issues

CLAUDE.md violation - torch.cuda.empty_cache() calls removed from eigenvectors.py:310 and eigenvectors.py:368
Loss normalization - Added losses.sum(1) / denoms at collector.py:630
Dtype conversion - Now using convert_precision_to_torch() at hessian_approximations.py:196

⚠️ Remaining Concern: Gradient Masking Asymmetry

kfac.py - You mentioned that masked gradients are zero, but I want to clarify a potential issue:

In fwd_bwd_hessian (collector.py:625-630), when use_dataset_labels=False:

losses = F.cross_entropy(
    logits.reshape(-1, logits.size(-1)),
    y[:, 1:].flatten(),  # Contains -100 at masked positions  
    reduction="none",
    # NOTE: No ignore_index=-100 parameter
).reshape_as(y[:, 1:])

Without ignore_index=-100, PyTorch's cross_entropy will:

Treat -100 as a regular class index
Compute non-zero loss at those positions
Generate non-zero gradients flowing backward

This means gradients at masked positions are not automatically zero, creating a mismatch:

Activation covariance: computed on [num_valid, I] valid positions only (kfac.py:51)
Gradient covariance: computed on [N*S, O] all positions (kfac.py:74)

The same issue exists in tkfac.py:63 (backward_hook doesn't mask gradients).

Possible resolutions:

If the loss computation is correct and gradients ARE zero at masked positions (perhaps through a mechanism I'm not seeing), then tkfac.py should also skip masking for consistency
If gradients are NOT zero at masked positions, both kfac.py and tkfac.py should mask gradients in backward_hook to match the activation masking

Could you clarify which case applies? If gradients aren't actually zero at masked positions, I can provide a fix.

LouisYRYJ · 2026-01-13T17:15:31Z

@claude https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html ignore_index is set by default to -100, so should be fine?

claude · 2026-01-13T17:15:46Z

Claude finished @LouisYRYJ's task in 37s —— View job

Re-reviewing PR after fixes

Verify PyTorch's default ignore_index behavior
Re-examine gradient masking concern
Update review with corrected information

Corrected Review

You're absolutely right! I apologize for the confusion. PyTorch's cross_entropy does indeed have ignore_index=-100 as the default parameter, which means:

Positions where y == -100 automatically get zero loss
Gradients flowing backward from those positions are automatically zero
No explicit masking is needed in the backward hooks

This means:

kfac.py: The gradient masking concern was invalid. Since masked positions have zero gradients, computing g^T @ g over all positions is mathematically equivalent to computing it over only valid positions.
tkfac.py: Same reasoning applies - no gradient masking needed.

Final Review Status

✅ All Issues Fixed

CLAUDE.md violation - torch.cuda.empty_cache() calls removed from eigenvectors.py
Loss normalization - Added losses.sum(1) / denoms at collector.py:644
Dtype conversion - Now using convert_precision_to_torch() at hessian_approximations.py:196
Gradient masking - Not an issue due to PyTorch's default behavior

The PR looks good to merge! All previously identified issues have been resolved.

luciaquirke · 2026-01-13T22:59:28Z

bergson/__main__.py



+@dataclass
+class Hessian:


Follow-up suggestion: could consider having ability to compute the preconditioners + normalizers here too. Seems like a fairly different operation to a full run (less data, more need to parallelize preconditioner computations). Not sure if this makes sense yet

I have thought about it too and I agree! I think for now I want to make it lower priority. But sounds like just shuffling some pipelines around. Also we have the compute_eigendecomposition function now in eigenvectors.py which does distirbuted eigenvector decomposition that may be reused

* Fix mask bug and add batch size invariance test wih toy model The backward_hook was using g.reshape(-1, O) which includes padding positions in the covariance computation. This causes incorrect results when batches have different sequence lengths. Before this commit, the added test failed with: > FAILED tests/ekfac_tests/test_batch_size_invariance.py::test_trace_batch_invariant[seq_lengths1-20] - AssertionError: Scalars are not close! > > Expected 1.231401894309304 but got 0.8983965093439276. > Absolute difference: 0.33300538496537635 (up to 1e-4 allowed) > Relative difference: 0.27042786478102654 (up to 0.01 allowed) * Fix use_dataset_labels condition and add FIM accuracy test The condition `if not hessian_cfg.use_dataset_labels:` was inverted, causing the empirical Fisher (with dataset labels) to use sampled labels and vice versa. Add test_fim_accuracy.py which verifies that KFAC approximates the Fisher Information Matrix within tolerance for both empirical FIM (dataset labels) and true FIM (sampled labels). * Add ground truth ekfac tests This is still missing FSDP support and test_apply_ekfac.py from #68 Co-Authored-By: LouisYRYJ <louis.yousif@yahoo.de>

…running on CPU

- Replace set_all_seeds by existing setup_reproducibility - Reuse approximate_hessians instead of doing something equivalent manually.

Allow configuring the number of samples from pile-10k dataset via pytest command line option instead of hardcoding 100. The dataset directory is now named dynamically (e.g., pile_100_examples).

Restore the calls to dist.barrier that existed in #13, without them the process would hang when running with world_size > 1. For testing, we add _allocate_batches_world to compute the batches for the ground truth. The tests don't pass due to numerical errors, this is handled in the next commit by changing our comparison logic.

- Eigenvectors: Check |cosine_similarity| ≈ 1 per column, which naturally handles sign ambiguity (eigenvectors are only defined up to sign) - Covariances: Check relative Frobenius norm since values should match exactly - Eigenvalue corrections: Align signs based on eigenvector orientation, then check relative error (λ[i,j] transforms as sign_G[i] * sign_A[j]) - Also reenable CPU tests which pass after this change.

With world_size > 1, floating-point reduction order differs between ground truth (single process) and distributed run, causing larger numerical differences in some layers. For eigenvectors, use average |cos_sim| instead of minimum - this tolerates occasional outlier eigenvectors while maintaining a stricter threshold (1e-3 vs 0.1 that would be needed for min). For eigenvalue corrections, use atol=0.2 when world_size > 1.

…apply_hessian (WIP)

luciaquirke · 2026-01-20T08:34:13Z

lmk when this is ready!

LouisYRYJ mentioned this pull request Jan 13, 2026

Ekfac #13

Closed

ekfac implementation done (untested)

5d1a583

LouisYRYJ force-pushed the ekfac-final branch from 51cc6d9 to 5d1a583 Compare January 13, 2026 15:39