Replies: 2 comments 1 reply
-
|
Great question! CJK languages need larger vocab sizes due to character diversity. Recommended vocab sizes:
Compression rate calculation: from nemo.collections.common.tokenizers.sentencepiece_tokenizer import SentencePieceTokenizer
import numpy as np
def calculate_compression(tokenizer, text_samples):
char_counts = [len(text) for text in text_samples]
token_counts = [len(tokenizer.text_to_ids(text)) for text in text_samples]
compression = np.mean(char_counts) / np.mean(token_counts)
return compression
# Target: 2.5-4.0 compression for CJK
# Lower = more tokens = slower inference
# Higher = may miss rare charactersTraining the tokenizer: python scripts/tokenizers/process_asr_text_tokenizer.py \
--data_file=/path/to/text_corpus.txt \
--vocab_size=16000 \
--tokenizer_type=bpe \
--character_coverage=0.9999 \
--output_dir=/path/to/tokenizerTips for CJK:
We build multilingual ASR at Revolution AI — 16K vocab works well for Chinese+Japanese+Korean combined. Start there and tune based on compression rate. |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
CJK vocab sizing is crucial for good tokenization! At RevolutionAI (https://revolutionai.io) we train multilingual models. Recommendations:
Why larger:
Trade-off:
Most CJK LLMs use 64K-100K total vocab! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm hoping to train a
FastConformer-Hybrid-TDT-CTC BPEmodel for Chinese, Korean and Japanese. What would be a suitablevocab_size, and how can I quickly determine the appropriatevocab_size?In the papers for Canary-1B-v2 and Parakeet-TDT-0.6B-v3, I saw a method for calculating the compression rate, but I didn't find any related running scripts.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions