Skip to content

Urdu TTS G2P support to NeMo#15446

Open
mwzkhalil wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
mwzkhalil:feature/urdu-ipa-g2p
Open

Urdu TTS G2P support to NeMo#15446
mwzkhalil wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
mwzkhalil:feature/urdu-ipa-g2p

Conversation

@mwzkhalil
Copy link

[Feature/PR] Add Urdu (ur-PK) IPA G2P support to NeMo TTS

#15445

Module: nemo/collections/tts/g2p/
Type: New language support


Summary

This PR adds UrduIpaG2p — a Grapheme-to-Phoneme module for Urdu (ur-PK)
that converts Urdu text written in Nastaliq/Naskh script into IPA phoneme
sequences. It follows the exact same design pattern as the existing
EnglishG2p (en_us_arpabet.py) and ChineseG2p (zh_cn_pinyin.py)
modules, subclassing BaseG2p directly.


Motivation

Urdu is spoken by approximately 230 million people worldwide and is the
national language of Pakistan. Despite this, NeMo currently has no G2P support
for Urdu, making it impossible to train TTS models for Urdu using the standard
NeMo pipeline.

This contribution provides:

  • A complete, tested UrduIpaG2p class (nemo/collections/tts/g2p/models/ur_pk_ipa.py)
  • A pronunciation dictionary of ~470,000 Urdu word/phrase → IPA entries in JSON format
  • Full feature parity with existing NeMo G2P modules

Urdu Script Notes

Property Detail
Script Nastaliq / Naskh (Arabic-based, RTL)
Unicode range U+0600–U+06FF (core), U+0750–U+077F (extensions: ڈ ڑ ھ ے ں)
Letter case None — no uppercase/lowercase distinction
Phoneme set IPA (Urdu-specific phones: ɦ, ʔ, ɖ, ɽ, t̪, d̪, etc.)
Word boundary Whitespace-delimited (after NFC normalisation)

Files Changed

nemo/collections/tts/g2p/models/ur_pk_ipa.py     ← new file
nemo/collections/tts/g2p/modules.py               ← add UrduIpaG2p import
scripts/tts_dataset_files/urdu_ipa_dict.json      ← ~470k entries
tests/collections/tts/g2p/test_ur_pk_ipa.py       ← new tests

Implementation Details

Dictionary format (JSON):

{
  "غیر حاضری":    "ɣɛːr hɑːzriː",
  "شوکت خانم ليب": "ʃoːˈkət̪ xɑːˈnəm leːb",
  "مختار احمد":   "mʊxˈt̪aːr ˈæhməd"
}

Keys are single words or multi-word phrases; values are space-separated IPA.

Key design decisions:

  1. Subclasses BaseG2p directly (not IpaG2p) — IpaG2p.__init__
    unconditionally calls set_grapheme_case(), which is meaningless for
    Urdu script (no letter case) and raises ValueError with case=None.
    This mirrors the approach taken by EnglishG2p and ChineseG2p.

  2. Longest-phrase-first matching__call__ tries up to max_phrase_len
    (default 4) consecutive tokens as a phrase key before falling back to
    single-word lookup, enabling correct handling of named entities and
    compound words that span multiple tokens.

  3. NFC normalisation — both input text and dictionary keys are
    NFC-normalised at load and inference time, ensuring consistent lookup
    regardless of how the Urdu text was composed.

  4. Full feature parity with existing G2P modules:

    • heteronyms support
    • apply_to_oov_word fallback
    • use_stresses toggle
    • phoneme_probability for mixed grapheme/phoneme training
    • Hyphenated OOV word splitting

Usage:

from nemo.collections.tts.g2p.models.ur_pk_ipa import UrduIpaG2p

g2p = UrduIpaG2p(phoneme_dict="scripts/tts_dataset_files/urdu_ipa_dict.json")

g2p("غیر حاضری")
# -> ['ɣɛːr', 'hɑːzriː']

g2p("شوکت خانم ليب")
# -> ['ʃoːˈkət̪', 'xɑːˈnəm', 'leːb']

Pronunciation Dictionary

  • Size: ~470,000 entries
  • Coverage: single words, named entities, multi-word phrases, abbreviations
  • Source: Collected and IPA-transcribed for Urdu TTS research
  • Format: UTF-8 JSON, NFC-normalised

Testing

python3 -m pytest tests/collections/tts/g2p/test_ur_pk_ipa.py -v

Checklist

  • Follows existing NeMo G2P module patterns (EnglishG2p, ChineseG2p)
  • Subclasses BaseG2p directly
  • Full docstrings (module, class, all methods)
  • Registered in modules.py
  • ~470k entry pronunciation dictionary included
  • Unit tests
  • Config YAML example (happy to add if requested)
  • Integration with IPATokenizer for FastPitch/VITS training

Related Issues / References


I am happy to address review feedback, add a YAML config example, or extend
the dictionary coverage. Thank you for considering this contribution!

@github-actions github-actions bot added the TTS label Feb 26, 2026
@mwzkhalil mwzkhalil force-pushed the feature/urdu-ipa-g2p branch from e922e0e to cecd3e2 Compare February 26, 2026 10:15
Signed-off-by: Mahwiz Khalil <khalilmahwiz@gmail.com>
@mwzkhalil mwzkhalil force-pushed the feature/urdu-ipa-g2p branch from cecd3e2 to 428fad6 Compare February 26, 2026 10:16
Signed-off-by: mwzkhalil <mwzkhalil@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant