Removes use of torchaudio and moves transforms inside of NeMo by blisc · Pull Request #15211 · NVIDIA-NeMo/NeMo

blisc · 2025-12-19T20:17:43Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Removes use of torchaudio.transforms and moves transforms inside of NeMo.
NOTE: we will use torchsquirm in nemo/collections/audio/metrics/squim.py and nemo/collections/tts/models/magpietts_preference_optimization.py

Collection: audio, asr, tts

Changelog

Move frequently used torchaudio transform into NeMo

PR Type:

New Feature
Bugfix
Documentation

Signed-off-by: Jason <jasoli@nvidia.com>

Signed-off-by: blisc <blisc@users.noreply.github.com>

Signed-off-by: Jason <jasoli@nvidia.com>

Signed-off-by: blisc <blisc@users.noreply.github.com>

Signed-off-by: Jason <jasoli@nvidia.com>

Signed-off-by: blisc <blisc@users.noreply.github.com>

Signed-off-by: Jason <jasoli@nvidia.com>

…udio

nithinraok

LGTM.

@chtruong814 / @ko3n1g review for docker related changes.

…-NeMo#15211) * remove use of torchaudio.transforms; SQUIM todo Signed-off-by: Jason <jasoli@nvidia.com> * Apply isort and black reformatting Signed-off-by: blisc <blisc@users.noreply.github.com> * add renamed file Signed-off-by: Jason <jasoli@nvidia.com> * Apply isort and black reformatting Signed-off-by: blisc <blisc@users.noreply.github.com> * fix autorefactor errors Signed-off-by: Jason <jasoli@nvidia.com> * fix linting issues Signed-off-by: Jason <jasoli@nvidia.com> * remove unneeded imports inside of audio collection Signed-off-by: Jason <jasoli@nvidia.com> * Apply isort and black reformatting Signed-off-by: blisc <blisc@users.noreply.github.com> * remove torchaudio from more files Signed-off-by: Jason <jasoli@nvidia.com> * update tests Signed-off-by: Jason <jasoli@nvidia.com> * Apply isort and black reformatting Signed-off-by: blisc <blisc@users.noreply.github.com> * change audio codec TA call Signed-off-by: Jason <jasoli@nvidia.com> * update import statement in speechlm2 Signed-off-by: Jason <jasoli@nvidia.com> --------- Signed-off-by: Jason <jasoli@nvidia.com> Signed-off-by: blisc <blisc@users.noreply.github.com> Co-authored-by: blisc <blisc@users.noreply.github.com> Signed-off-by: Akhil Varanasi <akhilvaranasi23@gmail.com>

…5211 Signed-off-by: Jason <jasoli@nvidia.com>

…5211 (#15384) Signed-off-by: Jason <jasoli@nvidia.com>

…IDIA-NeMo#15211 (NVIDIA-NeMo#15384) Signed-off-by: Jason <jasoli@nvidia.com>

…5211 (#15384) (#15391) Signed-off-by: Jason <jasoli@nvidia.com>

MahmoudAshraf97 · 2026-02-24T22:56:20Z

This PR is a breaking change to older models, please take action before it makes it to the next release

RuntimeError: Error(s) in loading state_dict for EncDecCTCModelBPE:
Missing key(s) in state_dict: "preprocessor.featurizer.window", "preprocessor.featurizer.fb".
Unexpected key(s) in state_dict: "preprocessor.featurizer._mel_spec_extractor.spectrogram.window", "preprocessor.featurizer._mel_spec_extractor.mel_scale.fb".

pzelasko · 2026-02-24T23:06:36Z

Which models is it breaking / how old?

MahmoudAshraf97 · 2026-02-24T23:21:16Z

I managed to reproduce it with a model trained using v1.23.0, I have another model that was trained using v2.2.1 that did not reproduce the issue, these are internal models that I cannot share but I'm happy to test models published on HF or prepare a minimum repro if needed

MahmoudAshraf97 · 2026-02-25T11:52:29Z

Further investigation shows that this is reproducible with any model that was trained with preprocessor.use_torchaudio=True regardless of the version used to train it

pzelasko · 2026-02-25T13:10:41Z

Torchaudio was removed as a dependency. Can you migrate all models to non torchaudio preprocessor?

MahmoudAshraf97 · 2026-02-25T13:27:50Z

I don't mind doing that, in fact I stopped using it a while ago, the problem arises when we try to load models that were trained using torchaudio in the preprocessor and that fails, the solution imo would be having a translation code to match the key names in the state dict during the model loading process or a script to convert old .nemo files that used torchaudio to a format that is accepted by the new versions (just modify the parameter names in the state dict)

pzelasko · 2026-02-25T19:39:24Z

@MahmoudAshraf97 see if this helps #15437

remove use of torchaudio.transforms; SQUIM todo

9a46c09

Signed-off-by: Jason <jasoli@nvidia.com>

blisc requested a review from pzelasko December 19, 2025 20:17

github-actions bot added TTS ASR audio labels Dec 19, 2025

blisc requested a review from nithinraok December 19, 2025 20:17

blisc added the Run CICD label Dec 19, 2025

Apply isort and black reformatting

f84393d

Signed-off-by: blisc <blisc@users.noreply.github.com>