Open
Conversation
e922e0e to
cecd3e2
Compare
Signed-off-by: Mahwiz Khalil <khalilmahwiz@gmail.com>
cecd3e2 to
428fad6
Compare
Signed-off-by: mwzkhalil <mwzkhalil@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Feature/PR] Add Urdu (ur-PK) IPA G2P support to NeMo TTS
#15445
Module:
nemo/collections/tts/g2p/Type: New language support
Summary
This PR adds
UrduIpaG2p— a Grapheme-to-Phoneme module for Urdu (ur-PK)that converts Urdu text written in Nastaliq/Naskh script into IPA phoneme
sequences. It follows the exact same design pattern as the existing
EnglishG2p(en_us_arpabet.py) andChineseG2p(zh_cn_pinyin.py)modules, subclassing
BaseG2pdirectly.Motivation
Urdu is spoken by approximately 230 million people worldwide and is the
national language of Pakistan. Despite this, NeMo currently has no G2P support
for Urdu, making it impossible to train TTS models for Urdu using the standard
NeMo pipeline.
This contribution provides:
UrduIpaG2pclass (nemo/collections/tts/g2p/models/ur_pk_ipa.py)Urdu Script Notes
Files Changed
Implementation Details
Dictionary format (JSON):
{ "غیر حاضری": "ɣɛːr hɑːzriː", "شوکت خانم ليب": "ʃoːˈkət̪ xɑːˈnəm leːb", "مختار احمد": "mʊxˈt̪aːr ˈæhməd" }Keys are single words or multi-word phrases; values are space-separated IPA.
Key design decisions:
Subclasses
BaseG2pdirectly (notIpaG2p) —IpaG2p.__init__unconditionally calls
set_grapheme_case(), which is meaningless forUrdu script (no letter case) and raises
ValueErrorwithcase=None.This mirrors the approach taken by
EnglishG2pandChineseG2p.Longest-phrase-first matching —
__call__tries up tomax_phrase_len(default 4) consecutive tokens as a phrase key before falling back to
single-word lookup, enabling correct handling of named entities and
compound words that span multiple tokens.
NFC normalisation — both input text and dictionary keys are
NFC-normalised at load and inference time, ensuring consistent lookup
regardless of how the Urdu text was composed.
Full feature parity with existing G2P modules:
heteronymssupportapply_to_oov_wordfallbackuse_stressestogglephoneme_probabilityfor mixed grapheme/phoneme trainingUsage:
Pronunciation Dictionary
Testing
Checklist
EnglishG2p,ChineseG2p)BaseG2pdirectlymodules.pyIPATokenizerfor FastPitch/VITS trainingRelated Issues / References
I am happy to address review feedback, add a YAML config example, or extend
the dictionary coverage. Thank you for considering this contribution!