I trained nemotron streaming model model for russian on the golos dataset and shows a large WER gap between in-domain validation data and the FLEURS Russian test set, despite never being trained on FLEURS-style data.
| Eval Set |
WER |
| In-domain val |
7.5% |
| FLEURS ru test |
27% |
For reference, a non-streaming Parakeet model finetuned on the same Russian data generalizes well to FLEURS without any read-speech training data, suggesting the generalization gap is at least partially a streaming-specific problem.
Setup:
Base model: nvidia/nemotron-speech-streaming-en-0.6b
Finetuned on: Golos dataset
att_context_size: default, didn't change
Decoding: greedy_batch, cache-aware streaming pipeline
Any help with this will be really appreciated.
@nithinraok @KunalDhawan