MagpieTTS decoder model on top of NeMo main branch#15277
MagpieTTS decoder model on top of NeMo main branch#15277paarthneekhara wants to merge 75 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
…hneekhara/NeMo into magpietts_decoderonly_2601
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
…hneekhara/NeMo into magpietts_decoderonly_2601
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
| @@ -0,0 +1,173 @@ | |||
| name: Magpie-TTS-DecoderOnly-EN | |||
There was a problem hiding this comment.
Have we tested the non-Lhotse path?
| @@ -0,0 +1,1464 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
I've been told there's an updated nemotronh model file at https://github.com/NVIDIA-NeMo/Automodel/pull/1091/changes#diff-047ae4149c298197dac7920766cbfabb8a82ef350e6b00dafbfc7361534fd85a
we should move to this.
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
blisc
left a comment
There was a problem hiding this comment.
Some more comments from WIP review
| @@ -182,7 +182,11 @@ def run_inference_and_evaluation( | |||
| violin_plot_metrics.remove('utmosv2') | |||
There was a problem hiding this comment.
Let's split this file in 3:
- This should be renamed to tts_infer.py and contain the common elements: dataset loading
- Create one helper function for magpietts and one helper function for em-tts with different commandline arguments.
| """ | ||
| MagpieTTS Streaming Inference Test Script. | ||
|
|
||
| This script tests the streaming TTS inference functionality, supporting both | ||
| single sample (batch_size=1) and batched inference (batch_size>1). | ||
|
|
||
| For batched inference, each item in the batch can have different context lengths | ||
| and be in different processing phases (context, prompt, phoneme-only, audio). |
There was a problem hiding this comment.
Can you add to this as to how this differs from magpietts_inference.py?
| return [self._token2id[p] for p in ps] | ||
|
|
||
|
|
||
| class IPABPETokenizer: |
There was a problem hiding this comment.
Should we subclass Tokenizer instead of instantiation within the class?
| def __init__(self, tokenizer_path: str): | ||
| import os | ||
|
|
||
| from tokenizers import Tokenizer |
There was a problem hiding this comment.
Move import statements to the top of the file.
| elif isinstance(tokenizer, PreTrainedTokenizerBase): | ||
| _tokens = list(tokenizer.get_vocab().keys()) | ||
| tokens.extend(_tokens) | ||
| num_tokens = len(_tokens) | ||
| tokenizer_pad_ids[tokenizer_name] = tokenizer.pad_token_id + tokenizer_offset | ||
| pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.unk_token_id | ||
| if pad_token_id is None: | ||
| raise ValueError( | ||
| f"Tokenizer '{tokenizer_name}' has no pad_token_id or unk_token_id. " | ||
| "Please set one before using with AggregatedTTSTokenizer." | ||
| ) | ||
| tokenizer_pad_ids[tokenizer_name] = pad_token_id + tokenizer_offset |
There was a problem hiding this comment.
Does this affect existing MagpieTTS checkpoints?
| @@ -0,0 +1,9954 @@ | |||
| { | |||
There was a problem hiding this comment.
Let's move this file internally. When and we release a checkpoint that uses this, we should bundle this json into the .nemo archive.
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
54d6283 to
06c516f
Compare
* PO for EM-TTS Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com> * add PO mode in training Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com> * PO code update Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com> * wip Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * wip Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * wip Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * wip Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * bug fixes Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * logging for gradient tracking Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * GRPO working Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> --------- Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com> Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>
Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
| def process_text_for_cer(input_text): | ||
| """ | ||
| Normalizes text for CER/WER calculation. | ||
| """ |
| from nemo.collections.tts.modules.nemotron_h_decoder import ( | ||
| HybridMambaAttentionDynamicCache, | ||
| NemotronHConfig, | ||
| NemotronHForCausalLM, | ||
| NemotronHModel, | ||
| ) |
There was a problem hiding this comment.
This isn't necessary. Let's remove
| if self.config.is_decoder_only_model: | ||
| _load_16khz_audio = False | ||
| _use_text_conditioning_encoder = True | ||
| _pad_context_text_to_max_duration = False | ||
| else: | ||
| _load_16khz_audio = self.model.model_type == 'single_encoder_sv_tts' | ||
| _use_text_conditioning_encoder = self.model.use_text_conditioning_encoder | ||
| _pad_context_text_to_max_duration = self.model.pad_context_text_to_max_duration | ||
|
|
There was a problem hiding this comment.
Instead of having if else's throughout the code, let's split this into 2 classes: 1 for encoder-decoder Magpie and 1 for EasyMagpie
| if self.phoneme_tokenizer is None and self.phoneme_tokenizer_config is not None: | ||
| worker_info = torch.utils.data.get_worker_info() | ||
| worker_id = worker_info.id if worker_info is not None else 0 | ||
| logging.info(f"Worker {worker_id} initializing phoneme tokenizer...") | ||
| self.phoneme_tokenizer = instantiate_phoneme_tokenizer(self.phoneme_tokenizer_config) |
There was a problem hiding this comment.
Can you add a comment explaining why we have to do this?
| def instantiate_phoneme_tokenizer(phoneme_tokenizer_config): | ||
| phoneme_tokenizer = instantiate(phoneme_tokenizer_config) | ||
| phoneme_vocab_size = len(phoneme_tokenizer.tokens) | ||
| phoneme_tokenizer.bos_token_id = phoneme_vocab_size | ||
| phoneme_tokenizer.eos_token_id = phoneme_vocab_size + 1 | ||
| phoneme_tokenizer.unk_token_id = phoneme_vocab_size + 2 | ||
| phoneme_tokenizer.vocab_size = phoneme_vocab_size + 3 | ||
| return phoneme_tokenizer |
There was a problem hiding this comment.
I'm not sure when you call this function, but this should be part of the tokenizer class not a util function in the dataset.
There was a problem hiding this comment.
However, this only exists in the Lhotse file but not the non-Lhotse file?
| mode_idx: Index of this mode in the list of modes (used for task embedding lookup) | ||
| """ | ||
|
|
||
| name: str |
There was a problem hiding this comment.
Should we define the name automatically from the parameters rather than relying on the user to specify it?
| dataset.phoneme_tokenizer = instantiate_phoneme_tokenizer(dataset.phoneme_tokenizer_config) | ||
|
|
||
|
|
||
| class EasyMagpieTTSModel(ModelPT): |
There was a problem hiding this comment.
This is a really large file. Can we split it up? Some suggestions
- Anything that's common with Encoder-Decoder Magpie, let's move to a separate base class:
- The code manipulaiton functions
- The local transformer functions
- etc
- Let's move the dataclasses to another file, although we can debate this
- Let's move worker_init_fn too since it should be common to both models
- Could consider splitting training and inference into two classes as well
| try: | ||
| import torchaudio | ||
| from torchaudio.pipelines import SQUIM_OBJECTIVE | ||
|
|
||
| HAVE_TORCHAUDIO = True | ||
| except ImportError: | ||
| HAVE_TORCHAUDIO = False |
There was a problem hiding this comment.
Let's remove SQUIM, I don't think we use it anymore
| phoneme_input_type = 'gt' if random.random() < gt_phoneme_input_prob else 'pred' | ||
|
|
||
| generation_start_time = time.perf_counter() | ||
| print("Inference started") |
There was a problem hiding this comment.
Switch print statments to logging
| snapshot[id(p)] = p.data.clone() | ||
| return snapshot | ||
|
|
||
| def _print_grad_weight_summary(self, metrics: Dict[str, float], step: int) -> None: |
There was a problem hiding this comment.
This function does not depend on self. Consider moving all helper print functions into a separate file and call them within the model instead of defining additional class functions
* config options Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * flash attention and timing stats Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * clean up timing code Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> --------- Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
* config options Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * flash attention and timing stats Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> * clean up timing code Signed-off-by: Paarth Neekhara <paarth.n@gmail.com> --------- Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
* add utmos to PO Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * utmos in PO Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * whisper update Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> * batched utmos Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com> --------- Signed-off-by: Shehzeen Hussain <shehzeenh@nvidia.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Shehzeen Hussain <shehzeensh@gmail.com>
No description provided.