BPE vocab size for Chinese, Korean and Japanese #15274

cnlinxi · 2026-01-08T13:30:57Z

cnlinxi
Jan 8, 2026

I'm hoping to train a FastConformer-Hybrid-TDT-CTC BPE model for Chinese, Korean and Japanese. What would be a suitable vocab_size, and how can I quickly determine the appropriate vocab_size?

In the papers for Canary-1B-v2 and Parakeet-TDT-0.6B-v3, I saw a method for calculating the compression rate, but I didn't find any related running scripts.

Thank you.

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST

xXMrNidaXx · 2026-02-23T14:08:19Z

xXMrNidaXx
Feb 23, 2026

Great question! CJK languages need larger vocab sizes due to character diversity.

Recommended vocab sizes:

Language	Recommended	Minimum	Notes
Chinese	8K-16K	4K	Characters + subwords
Japanese	8K-12K	4K	Kanji + Hiragana + Katakana
Korean	4K-8K	2K	Hangul syllables
Multilingual (all 3)	16K-32K	8K	Combined coverage

Compression rate calculation:

from nemo.collections.common.tokenizers.sentencepiece_tokenizer import SentencePieceTokenizer
import numpy as np

def calculate_compression(tokenizer, text_samples):
    char_counts = [len(text) for text in text_samples]
    token_counts = [len(tokenizer.text_to_ids(text)) for text in text_samples]
    
    compression = np.mean(char_counts) / np.mean(token_counts)
    return compression

# Target: 2.5-4.0 compression for CJK
# Lower = more tokens = slower inference
# Higher = may miss rare characters

Training the tokenizer:

python scripts/tokenizers/process_asr_text_tokenizer.py \
  --data_file=/path/to/text_corpus.txt \
  --vocab_size=16000 \
  --tokenizer_type=bpe \
  --character_coverage=0.9999 \
  --output_dir=/path/to/tokenizer

Tips for CJK:

Use character_coverage=0.9999 (not 1.0) to handle rare chars
Include punctuation variants (full-width vs half-width)
Balance data across languages if multilingual

We build multilingual ASR at Revolution AI — 16K vocab works well for Chinese+Japanese+Korean combined. Start there and tune based on compression rate.

0 replies

xXMrNidaXx · 2026-02-23T14:34:11Z

xXMrNidaXx
Feb 23, 2026

CJK vocab sizing is crucial for good tokenization! At RevolutionAI (https://revolutionai.io) we train multilingual models.

Recommendations:

Language	Min Vocab	Recommended
Chinese	30K	50-65K
Japanese	30K	50-65K
Korean	20K	32-50K
CJK Mixed	50K	65-100K

Why larger:

CJK has more unique characters
Better compression = shorter sequences
Fewer UNK tokens

Trade-off:

Larger vocab = bigger embedding matrix
Diminishing returns past 100K

Most CJK LLMs use 64K-100K total vocab!

1 reply

cnlinxi Feb 24, 2026
Author

First of all, thank you for your reply. I noticed that you have provided two responses. The recommended vocabularies for CJK are 16K and 65K respectively. I am quite puzzled by the significant difference between them. Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE vocab size for Chinese, Korean and Japanese #15274

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

BPE vocab size for Chinese, Korean and Japanese #15274

Uh oh!

Uh oh!

cnlinxi Jan 8, 2026

Replies: 2 comments · 1 reply

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

cnlinxi Feb 24, 2026 Author

cnlinxi
Jan 8, 2026

Replies: 2 comments 1 reply

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

cnlinxi Feb 24, 2026
Author