MagpieTTS decoder model on top of NeMo main branch by paarthneekhara · Pull Request #15277 · NVIDIA-NeMo/NeMo

paarthneekhara · 2026-01-08T21:56:34Z

No description provided.

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

…hneekhara/NeMo into magpietts_decoderonly_2601

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

…hneekhara/NeMo into magpietts_decoderonly_2601

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

blisc · 2026-02-03T17:57:30Z

examples/tts/conf/magpietts/easy_magpietts.yaml

@@ -0,0 +1,173 @@
+name: Magpie-TTS-DecoderOnly-EN


Have we tested the non-Lhotse path?

blisc · 2026-02-10T17:27:04Z

nemo/collections/tts/modules/nemotron_h_decoder.py

@@ -0,0 +1,1464 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.


I've been told there's an updated nemotronh model file at https://github.com/NVIDIA-NeMo/Automodel/pull/1091/changes#diff-047ae4149c298197dac7920766cbfabb8a82ef350e6b00dafbfc7361534fd85a
we should move to this.

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

blisc

Some more comments from WIP review

blisc · 2026-02-10T17:30:03Z

examples/tts/magpietts_inference.py

@@ -182,7 +182,11 @@ def run_inference_and_evaluation(
        violin_plot_metrics.remove('utmosv2')


Let's split this file in 3:

This should be renamed to tts_infer.py and contain the common elements: dataset loading

Create one helper function for magpietts and one helper function for em-tts with different commandline arguments.

blisc · 2026-02-10T17:34:06Z

examples/tts/magpietts_streaming_inference.py

+"""
+MagpieTTS Streaming Inference Test Script.
+
+This script tests the streaming TTS inference functionality, supporting both
+single sample (batch_size=1) and batched inference (batch_size>1).
+
+For batched inference, each item in the batch can have different context lengths
+and be in different processing phases (context, prompt, phoneme-only, audio).


Can you add to this as to how this differs from magpietts_inference.py?

blisc · 2026-02-10T17:35:55Z

nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py

        return [self._token2id[p] for p in ps]


+class IPABPETokenizer:


Should we subclass Tokenizer instead of instantiation within the class?

blisc · 2026-02-10T17:36:05Z

nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py

+    def __init__(self, tokenizer_path: str):
+        import os
+
+        from tokenizers import Tokenizer


Move import statements to the top of the file.

blisc · 2026-02-10T17:36:51Z

nemo/collections/common/tokenizers/text_to_speech/tts_tokenizers.py

            elif isinstance(tokenizer, PreTrainedTokenizerBase):
                _tokens = list(tokenizer.get_vocab().keys())
                tokens.extend(_tokens)
                num_tokens = len(_tokens)
-                tokenizer_pad_ids[tokenizer_name] = tokenizer.pad_token_id + tokenizer_offset
+                pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.unk_token_id
+                if pad_token_id is None:
+                    raise ValueError(
+                        f"Tokenizer '{tokenizer_name}' has no pad_token_id or unk_token_id. "
+                        "Please set one before using with AggregatedTTSTokenizer."
+                    )
+                tokenizer_pad_ids[tokenizer_name] = pad_token_id + tokenizer_offset


Does this affect existing MagpieTTS checkpoints?

blisc · 2026-02-10T17:55:20Z

scripts/tts_dataset_files/bpe_ipa_tokenizer_2048_en_de_es_fr_hi_it_vi_zh.json

@@ -0,0 +1,9954 @@
+{


Let's move this file internally. When and we release a checkpoint that uses this, we should bundle this json into the .nemo archive.

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

tests/collections/tts/test_infer_vs_process_batch.py

+    batch_size = batch['text'].size(0)
+    phoneme_stacking_factor = model.phoneme_stacking_factor
+    phoneme_vocab_size = model.phoneme_vocab_size
+


Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

* PO for EM-TTS Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com> * add PO mode in training Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com> * PO code update Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com> * wip Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * wip Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * wip Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * wip Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * bug fixes Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * logging for gradient tracking Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * GRPO working Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> --------- Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com> Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

blisc · 2026-02-20T17:07:14Z

nemo/collections/tts/parts/utils/helpers.py

+def process_text_for_cer(input_text):
+    """
+    Normalizes text for CER/WER calculation.
+    """


FYI: @rlangman @rfejgin since we were talking about this. Let's lift this from the decoder PR and move it to main early

blisc · 2026-02-20T17:07:48Z

nemo/collections/tts/modules/__init__.py

+from nemo.collections.tts.modules.nemotron_h_decoder import (
+    HybridMambaAttentionDynamicCache,
+    NemotronHConfig,
+    NemotronHForCausalLM,
+    NemotronHModel,
+)


This isn't necessary. Let's remove

blisc · 2026-02-20T17:09:19Z

nemo/collections/tts/modules/magpietts_inference/inference.py

+            if self.config.is_decoder_only_model:
+                _load_16khz_audio = False
+                _use_text_conditioning_encoder = True
+                _pad_context_text_to_max_duration = False
+            else:
+                _load_16khz_audio = self.model.model_type == 'single_encoder_sv_tts'
+                _use_text_conditioning_encoder = self.model.use_text_conditioning_encoder
+                _pad_context_text_to_max_duration = self.model.pad_context_text_to_max_duration
+


Instead of having if else's throughout the code, let's split this into 2 classes: 1 for encoder-decoder Magpie and 1 for EasyMagpie

blisc · 2026-02-20T17:12:11Z

nemo/collections/tts/data/text_to_speech_dataset_lhotse.py

+        if self.phoneme_tokenizer is None and self.phoneme_tokenizer_config is not None:
+            worker_info = torch.utils.data.get_worker_info()
+            worker_id = worker_info.id if worker_info is not None else 0
+            logging.info(f"Worker {worker_id} initializing phoneme tokenizer...")
+            self.phoneme_tokenizer = instantiate_phoneme_tokenizer(self.phoneme_tokenizer_config)


Can you add a comment explaining why we have to do this?

blisc · 2026-02-20T17:12:56Z

nemo/collections/tts/data/text_to_speech_dataset_lhotse.py

+def instantiate_phoneme_tokenizer(phoneme_tokenizer_config):
+    phoneme_tokenizer = instantiate(phoneme_tokenizer_config)
+    phoneme_vocab_size = len(phoneme_tokenizer.tokens)
+    phoneme_tokenizer.bos_token_id = phoneme_vocab_size
+    phoneme_tokenizer.eos_token_id = phoneme_vocab_size + 1
+    phoneme_tokenizer.unk_token_id = phoneme_vocab_size + 2
+    phoneme_tokenizer.vocab_size = phoneme_vocab_size + 3
+    return phoneme_tokenizer


I'm not sure when you call this function, but this should be part of the tokenizer class not a util function in the dataset.

However, this only exists in the Lhotse file but not the non-Lhotse file?

blisc · 2026-02-20T17:15:47Z

nemo/collections/tts/models/easy_magpietts.py

+        mode_idx: Index of this mode in the list of modes (used for task embedding lookup)
+    """
+
+    name: str


Should we define the name automatically from the parameters rather than relying on the user to specify it?

blisc · 2026-02-20T17:22:04Z

nemo/collections/tts/models/easy_magpietts.py

+        dataset.phoneme_tokenizer = instantiate_phoneme_tokenizer(dataset.phoneme_tokenizer_config)
+
+
+class EasyMagpieTTSModel(ModelPT):


This is a really large file. Can we split it up? Some suggestions

Anything that's common with Encoder-Decoder Magpie, let's move to a separate base class:

The code manipulaiton functions

The local transformer functions

etc

Let's move the dataclasses to another file, although we can debate this

Let's move worker_init_fn too since it should be common to both models

Could consider splitting training and inference into two classes as well

blisc · 2026-02-20T17:22:28Z

nemo/collections/tts/models/easy_magpietts_preference_optimization.py

+try:
+    import torchaudio
+    from torchaudio.pipelines import SQUIM_OBJECTIVE
+
+    HAVE_TORCHAUDIO = True
+except ImportError:
+    HAVE_TORCHAUDIO = False


Let's remove SQUIM, I don't think we use it anymore

blisc · 2026-02-20T17:24:07Z

nemo/collections/tts/models/easy_magpietts_preference_optimization.py

+            phoneme_input_type = 'gt' if random.random() < gt_phoneme_input_prob else 'pred'
+
+        generation_start_time = time.perf_counter()
+        print("Inference started")


Switch print statments to logging

blisc · 2026-02-20T17:26:01Z

nemo/collections/tts/models/easy_magpietts_preference_optimization.py

+                snapshot[id(p)] = p.data.clone()
+        return snapshot
+
+    def _print_grad_weight_summary(self, metrics: Dict[str, float], step: int) -> None:


This function does not depend on self. Consider moving all helper print functions into a separate file and call them within the model instead of defining additional class functions

* config options Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * flash attention and timing stats Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * clean up timing code Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> --------- Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

* config options Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * flash attention and timing stats Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * clean up timing code Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> --------- Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

* add utmos to PO Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * utmos in PO Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * whisper update Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * batched utmos Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> --------- Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

MagpieTTS decoder model working on top of NeMo main branch

f71bf08

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

github-actions bot added TTS common labels Jan 8, 2026

paarthneekhara and others added 23 commits January 8, 2026 14:11

Merge branch 'main' into magpietts_decoderonly_2601

79aab79

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Apply isort and black reformatting

8c83ca4

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

merge wit main again

5102378

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

resolve conflicts

3261df1

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Apply isort and black reformatting

fbb3528

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

handling changes in dataloader

12f0f98

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Apply isort and black reformatting

adc75bc

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

hack to avoid HF error

23e1ec4

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Merge branch 'magpietts_decoderonly_2601' of https://github.com/paart…

d26f20b

…hneekhara/NeMo into magpietts_decoderonly_2601

Apply isort and black reformatting

8a138fa

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

remove discriminatory temporarily

8836939

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Merge branch 'magpietts_decoderonly_2601' of https://github.com/paart…

dadda8d

…hneekhara/NeMo into magpietts_decoderonly_2601

fix errors

18ead3f

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

bug fix

dc661fc

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

add moe

da0d7cd

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Apply isort and black reformatting

bd75bf8

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

20 layer moe

9b51c8c

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

some refactoring and clean up

9eaaa85

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Apply isort and black reformatting

2dddcd4

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

bug fix related to spectral codec

40dbb79

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Apply isort and black reformatting

693ad49

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

some clean up

5983436

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

add docstrings and data classes

0114c74

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

zhehuaichen requested a review from Edresson January 28, 2026 21:20

paarthneekhara and others added 3 commits January 28, 2026 19:33

more doc strings

98dabba

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Apply isort and black reformatting

8f142ac

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

support multiple training modes

e3909eb

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

shehzeen added 6 commits February 9, 2026 00:59

reduce dropout prob, change default delays to 0,1

b0b3612

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

bug fix

625e012

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

phoneme EOS handling bug fix

df8c338

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

phoneme corruption methodology implemented

38da2eb

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

revisit defaults and update

4854569

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

bug fix phoneme loss

8594aa5

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

blisc reviewed Feb 10, 2026

View reviewed changes

shehzeen added 2 commits February 10, 2026 09:47

another inference bug fix

dc3a7e6

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

phoneme vocab size fix

3274d3f

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

blisc reviewed Feb 10, 2026

View reviewed changes

shehzeen added 2 commits February 10, 2026 10:04

bug fix

8c3542f

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

handle legacy model phoneme vocab size

5950134

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

github-advanced-security bot found potential problems Feb 11, 2026

View reviewed changes

tests/collections/tts/test_infer_vs_process_batch.py

batch_size = batch['text'].size(0)

phoneme_stacking_factor = model.phoneme_stacking_factor

phoneme_vocab_size = model.phoneme_vocab_size

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable T_phoneme is not used.

shehzeen added 2 commits February 11, 2026 12:06

context duration handling - stop repeating excessively

5b51466

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

clamp cer and wer to 1

06c516f

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

shehzeen force-pushed the magpietts_decoderonly_2601 branch from 54d6283 to 06c516f Compare February 12, 2026 00:12

github-actions bot added the core Changes to NeMo Core label Feb 17, 2026

shehzeen and others added 3 commits February 18, 2026 11:47

po stabilize

a735cfe

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

mamba config update

029c2b7

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

fix weight initialization bugs in mamba

bc33f15

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

blisc requested changes Feb 20, 2026

View reviewed changes

paarthneekhara and others added 8 commits February 20, 2026 14:20

add do tts method

743c4db

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

bug fix

8b15d37

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

full phoneme channel dropout option

420d9b2

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

gt phoneme option in do_tts

f303b89

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

bug fix

175850c

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

		@@ -0,0 +1,1464 @@
		# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -182,7 +182,11 @@ def run_inference_and_evaluation(
		violin_plot_metrics.remove('utmosv2')

		return [self._token2id[p] for p in ps]


		class IPABPETokenizer:

		dataset.phoneme_tokenizer = instantiate_phoneme_tokenizer(dataset.phoneme_tokenizer_config)


		class EasyMagpieTTSModel(ModelPT):

Conversation

paarthneekhara commented Jan 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blisc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Check notice

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants