Urdu TTS G2P support to NeMo by mwzkhalil · Pull Request #15446 · NVIDIA-NeMo/NeMo

mwzkhalil · 2026-02-26T09:22:39Z

[Feature/PR] Add Urdu (ur-PK) IPA G2P support to NeMo TTS

Module: nemo/collections/tts/g2p/
Type: New language support

Summary

This PR adds UrduIpaG2p — a Grapheme-to-Phoneme module for Urdu (ur-PK)
that converts Urdu text written in Nastaliq/Naskh script into IPA phoneme
sequences. It follows the exact same design pattern as the existing
EnglishG2p (en_us_arpabet.py) and ChineseG2p (zh_cn_pinyin.py)
modules, subclassing BaseG2p directly.

Motivation

Urdu is spoken by approximately 230 million people worldwide and is the
national language of Pakistan. Despite this, NeMo currently has no G2P support
for Urdu, making it impossible to train TTS models for Urdu using the standard
NeMo pipeline.

This contribution provides:

A complete, tested UrduIpaG2p class (nemo/collections/tts/g2p/models/ur_pk_ipa.py)
A pronunciation dictionary of ~470,000 Urdu word/phrase → IPA entries in JSON format
Full feature parity with existing NeMo G2P modules

Urdu Script Notes

Property	Detail
Script	Nastaliq / Naskh (Arabic-based, RTL)
Unicode range	U+0600–U+06FF (core), U+0750–U+077F (extensions: ڈ ڑ ھ ے ں)
Letter case	None — no uppercase/lowercase distinction
Phoneme set	IPA (Urdu-specific phones: ɦ, ʔ, ɖ, ɽ, t̪, d̪, etc.)
Word boundary	Whitespace-delimited (after NFC normalisation)

Files Changed

nemo/collections/tts/g2p/models/ur_pk_ipa.py     ← new file
nemo/collections/tts/g2p/modules.py               ← add UrduIpaG2p import
scripts/tts_dataset_files/urdu_ipa_dict.json      ← ~470k entries
tests/collections/tts/g2p/test_ur_pk_ipa.py       ← new tests

Implementation Details

Dictionary format (JSON):

{
  "غیر حاضری":    "ɣɛːr hɑːzriː",
  "شوکت خانم ليب": "ʃoːˈkət̪ xɑːˈnəm leːb",
  "مختار احمد":   "mʊxˈt̪aːr ˈæhməd"
}

Keys are single words or multi-word phrases; values are space-separated IPA.

Key design decisions:

Subclasses BaseG2p directly (not IpaG2p) — IpaG2p.__init__
unconditionally calls set_grapheme_case(), which is meaningless for
Urdu script (no letter case) and raises ValueError with case=None.
This mirrors the approach taken by EnglishG2p and ChineseG2p.
Longest-phrase-first matching — __call__ tries up to max_phrase_len
(default 4) consecutive tokens as a phrase key before falling back to
single-word lookup, enabling correct handling of named entities and
compound words that span multiple tokens.
NFC normalisation — both input text and dictionary keys are
NFC-normalised at load and inference time, ensuring consistent lookup
regardless of how the Urdu text was composed.
Full feature parity with existing G2P modules:
- heteronyms support
- apply_to_oov_word fallback
- use_stresses toggle
- phoneme_probability for mixed grapheme/phoneme training
- Hyphenated OOV word splitting

Usage:

from nemo.collections.tts.g2p.models.ur_pk_ipa import UrduIpaG2p

g2p = UrduIpaG2p(phoneme_dict="scripts/tts_dataset_files/urdu_ipa_dict.json")

g2p("غیر حاضری")
# -> ['ɣɛːr', 'hɑːzriː']

g2p("شوکت خانم ليب")
# -> ['ʃoːˈkət̪', 'xɑːˈnəm', 'leːb']

Pronunciation Dictionary

Size: ~470,000 entries
Coverage: single words, named entities, multi-word phrases, abbreviations
Source: Collected and IPA-transcribed for Urdu TTS research
Format: UTF-8 JSON, NFC-normalised

Testing

python3 -m pytest tests/collections/tts/g2p/test_ur_pk_ipa.py -v

Checklist

Follows existing NeMo G2P module patterns (EnglishG2p, ChineseG2p)
Subclasses BaseG2p directly
Full docstrings (module, class, all methods)
Registered in modules.py
~470k entry pronunciation dictionary included
Unit tests
Config YAML example (happy to add if requested)
Integration with IPATokenizer for FastPitch/VITS training

Related Issues / References

Urdu phonology: https://en.wikipedia.org/wiki/Urdu_phonology
Unicode Arabic block: https://www.unicode.org/charts/PDF/U0600.pdf
eSpeak-NG Urdu support: https://github.com/espeak-ng/espeak-ng

I am happy to address review feedback, add a YAML config example, or extend
the dictionary coverage. Thank you for considering this contribution!

Signed-off-by: Mahwiz Khalil <khalilmahwiz@gmail.com>

Signed-off-by: mwzkhalil <mwzkhalil@users.noreply.github.com>

github-actions bot added the TTS label Feb 26, 2026

mwzkhalil force-pushed the feature/urdu-ipa-g2p branch from e922e0e to cecd3e2 Compare February 26, 2026 10:15

Urdu TTS G2P support to NeMo

428fad6

Signed-off-by: Mahwiz Khalil <khalilmahwiz@gmail.com>

mwzkhalil force-pushed the feature/urdu-ipa-g2p branch from cecd3e2 to 428fad6 Compare February 26, 2026 10:16

Apply isort and black reformatting

031a8ae

Signed-off-by: mwzkhalil <mwzkhalil@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Urdu TTS G2P support to NeMo#15446

Urdu TTS G2P support to NeMo#15446
mwzkhalil wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
mwzkhalil:feature/urdu-ipa-g2p

mwzkhalil commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mwzkhalil commented Feb 26, 2026

[Feature/PR] Add Urdu (ur-PK) IPA G2P support to NeMo TTS

Summary

Motivation

Urdu Script Notes

Files Changed

Implementation Details

Pronunciation Dictionary

Testing

Checklist

Related Issues / References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant