Parakeet v2, v3 max_duration difference #15209

mingewang · 2025-12-18T21:50:03Z

mingewang
Dec 18, 2025

HI

I noticed a difference in model_config.yaml between Parakeet v2 and Parakeet v3 models:

Parakeet v2
train_ds:
max_duration: 40
validation_ds:
max_duration: 40

Parakeet v3
train_ds:
max_duration: 10
validation_ds:
max_duration: 30

Does this mean that Parakeet v3 was trained with training samples truncated to 10 seconds, while Parakeet v2 was trained with up to 40-second samples?
If so, what was the motivation for reducing the training max_duration in v3?

When I try to increase train_ds.max_duration for Parakeet v3 to 30 seconds, I easily run into CUDA OOM errors on an A100 (80GB).

Are there recommended settings (batch size, gradient accumulation, bucketing strategy, etc.) to safely train v3 with longer utterances?

Any guidance on best practices for handling longer audio with Parakeet v3 ( finetune) would be appreciated.

Thanks!

jeremy110 · 2025-12-22T08:32:20Z

jeremy110
Dec 22, 2025

@mingewang
Depending on your dataset, but typically max_duration is set to 30 to 40, and then used with lhotse for dataloader. The relevant settings can be referenced if you're interested. https://github.com/jeremy110/Finetune_Nemo_ASR

0 replies

xXMrNidaXx · 2026-02-23T14:55:39Z

xXMrNidaXx
Feb 23, 2026

Parakeet v2 vs v3 max_duration differences:

Key changes:

Version	Default max_duration	Why
v2	20s	Memory optimization
v3	30s+	Better architecture

Why v3 handles longer:

Improved attention mechanism
Better memory efficiency
Optimized chunking

Adjusting for your needs:

model:
  preprocessor:
    max_duration: 30  # Increase for longer audio
  
  # If OOM, try:
  encoder:
    subsampling_factor: 8  # More aggressive

Memory considerations:

# Estimate memory per second
memory_per_sec = model_params * 4 / 1e9  # GB
max_safe = available_gpu_memory / memory_per_sec

For very long audio:

Chunk into segments
Use streaming inference
Enable gradient checkpointing for training

Config for long-form:

trainer:
  precision: bf16  # Save memory
  accumulate_grad_batches: 4

We train ASR models at RevolutionAI. What's your target audio duration?

1 reply

mingewang Feb 23, 2026
Author

The target audio duration is: 30-40 seconds. are you able to fine-tune/train parakeet v3 ? If so, what is your GPU's spec. e.g: vRAM ? Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parakeet v2, v3 max_duration difference #15209

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Parakeet v2, v3 max_duration difference #15209

Uh oh!

mingewang Dec 18, 2025

Replies: 2 comments · 1 reply

Uh oh!

jeremy110 Dec 22, 2025

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

mingewang Feb 23, 2026 Author

mingewang
Dec 18, 2025

Replies: 2 comments 1 reply

jeremy110
Dec 22, 2025

xXMrNidaXx
Feb 23, 2026

mingewang Feb 23, 2026
Author