Skip to content

MagpieTTS decoder model on top of NeMo main branch#15277

Draft
paarthneekhara wants to merge 75 commits intoNVIDIA-NeMo:mainfrom
paarthneekhara:magpietts_decoderonly_2601
Draft

MagpieTTS decoder model on top of NeMo main branch#15277
paarthneekhara wants to merge 75 commits intoNVIDIA-NeMo:mainfrom
paarthneekhara:magpietts_decoderonly_2601

Conversation

@paarthneekhara
Copy link
Collaborator

No description provided.

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
paarthneekhara and others added 23 commits January 8, 2026 14:11
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
@zhehuaichen zhehuaichen requested a review from Edresson January 28, 2026 21:20
paarthneekhara and others added 3 commits January 28, 2026 19:33
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
@@ -0,0 +1,173 @@
name: Magpie-TTS-DecoderOnly-EN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested the non-Lhotse path?

@@ -0,0 +1,1464 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Copy link
Collaborator

@blisc blisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments from WIP review

@@ -182,7 +182,11 @@ def run_inference_and_evaluation(
violin_plot_metrics.remove('utmosv2')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split this file in 3:

  1. This should be renamed to tts_infer.py and contain the common elements: dataset loading
  2. Create one helper function for magpietts and one helper function for em-tts with different commandline arguments.

Comment on lines +14 to +21
"""
MagpieTTS Streaming Inference Test Script.

This script tests the streaming TTS inference functionality, supporting both
single sample (batch_size=1) and batched inference (batch_size>1).

For batched inference, each item in the batch can have different context lengths
and be in different processing phases (context, prompt, phoneme-only, audio).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add to this as to how this differs from magpietts_inference.py?

return [self._token2id[p] for p in ps]


class IPABPETokenizer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we subclass Tokenizer instead of instantiation within the class?

def __init__(self, tokenizer_path: str):
import os

from tokenizers import Tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move import statements to the top of the file.

Comment on lines +1159 to +1169
elif isinstance(tokenizer, PreTrainedTokenizerBase):
_tokens = list(tokenizer.get_vocab().keys())
tokens.extend(_tokens)
num_tokens = len(_tokens)
tokenizer_pad_ids[tokenizer_name] = tokenizer.pad_token_id + tokenizer_offset
pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.unk_token_id
if pad_token_id is None:
raise ValueError(
f"Tokenizer '{tokenizer_name}' has no pad_token_id or unk_token_id. "
"Please set one before using with AggregatedTTSTokenizer."
)
tokenizer_pad_ids[tokenizer_name] = pad_token_id + tokenizer_offset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this affect existing MagpieTTS checkpoints?

@@ -0,0 +1,9954 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this file internally. When and we release a checkpoint that uses this, we should bundle this json into the .nemo archive.

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
batch_size = batch['text'].size(0)
phoneme_stacking_factor = model.phoneme_stacking_factor
phoneme_vocab_size = model.phoneme_vocab_size

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable T_phoneme is not used.
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
@shehzeen shehzeen force-pushed the magpietts_decoderonly_2601 branch from 54d6283 to 06c516f Compare February 12, 2026 00:12
* PO for EM-TTS

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

* add PO mode in training

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

* PO code update

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>

* wip

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

* wip

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

* wip

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

* wip

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

* bug fixes

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

* logging for gradient tracking

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

* GRPO working

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

---------

Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>
@github-actions github-actions bot added the core Changes to NeMo Core label Feb 17, 2026
shehzeen and others added 3 commits February 18, 2026 11:47
Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Comment on lines +948 to +951
def process_text_for_cer(input_text):
"""
Normalizes text for CER/WER calculation.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: @rlangman @rfejgin since we were talking about this. Let's lift this from the decoder PR and move it to main early

Comment on lines +16 to +21
from nemo.collections.tts.modules.nemotron_h_decoder import (
HybridMambaAttentionDynamicCache,
NemotronHConfig,
NemotronHForCausalLM,
NemotronHModel,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary. Let's remove

Comment on lines +258 to +266
if self.config.is_decoder_only_model:
_load_16khz_audio = False
_use_text_conditioning_encoder = True
_pad_context_text_to_max_duration = False
else:
_load_16khz_audio = self.model.model_type == 'single_encoder_sv_tts'
_use_text_conditioning_encoder = self.model.use_text_conditioning_encoder
_pad_context_text_to_max_duration = self.model.pad_context_text_to_max_duration

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having if else's throughout the code, let's split this into 2 classes: 1 for encoder-decoder Magpie and 1 for EasyMagpie

Comment on lines +204 to +208
if self.phoneme_tokenizer is None and self.phoneme_tokenizer_config is not None:
worker_info = torch.utils.data.get_worker_info()
worker_id = worker_info.id if worker_info is not None else 0
logging.info(f"Worker {worker_id} initializing phoneme tokenizer...")
self.phoneme_tokenizer = instantiate_phoneme_tokenizer(self.phoneme_tokenizer_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining why we have to do this?

Comment on lines +63 to +70
def instantiate_phoneme_tokenizer(phoneme_tokenizer_config):
phoneme_tokenizer = instantiate(phoneme_tokenizer_config)
phoneme_vocab_size = len(phoneme_tokenizer.tokens)
phoneme_tokenizer.bos_token_id = phoneme_vocab_size
phoneme_tokenizer.eos_token_id = phoneme_vocab_size + 1
phoneme_tokenizer.unk_token_id = phoneme_vocab_size + 2
phoneme_tokenizer.vocab_size = phoneme_vocab_size + 3
return phoneme_tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure when you call this function, but this should be part of the tokenizer class not a util function in the dataset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this only exists in the Lhotse file but not the non-Lhotse file?

mode_idx: Index of this mode in the list of modes (used for task embedding lookup)
"""

name: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define the name automatically from the parameters rather than relying on the user to specify it?

dataset.phoneme_tokenizer = instantiate_phoneme_tokenizer(dataset.phoneme_tokenizer_config)


class EasyMagpieTTSModel(ModelPT):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really large file. Can we split it up? Some suggestions

  • Anything that's common with Encoder-Decoder Magpie, let's move to a separate base class:
    • The code manipulaiton functions
    • The local transformer functions
    • etc
  • Let's move the dataclasses to another file, although we can debate this
  • Let's move worker_init_fn too since it should be common to both models
  • Could consider splitting training and inference into two classes as well

Comment on lines +39 to +45
try:
import torchaudio
from torchaudio.pipelines import SQUIM_OBJECTIVE

HAVE_TORCHAUDIO = True
except ImportError:
HAVE_TORCHAUDIO = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove SQUIM, I don't think we use it anymore

phoneme_input_type = 'gt' if random.random() < gt_phoneme_input_prob else 'pred'

generation_start_time = time.perf_counter()
print("Inference started")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch print statments to logging

snapshot[id(p)] = p.data.clone()
return snapshot

def _print_grad_weight_summary(self, metrics: Dict[str, float], step: int) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not depend on self. Consider moving all helper print functions into a separate file and call them within the model instead of defining additional class functions

paarthneekhara and others added 8 commits February 20, 2026 14:20
* config options

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

* flash attention and timing stats

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

* clean up timing code

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

---------

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
* config options

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

* flash attention and timing stats

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

* clean up timing code

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

---------

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
* add utmos to PO

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

* utmos in PO

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

* whisper update

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

* batched utmos

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>

---------

Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common core Changes to NeMo Core TTS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants