[SpeechLM2] Bucketing training data #15360

AudranBert · 2026-02-05T16:52:22Z

AudranBert
Feb 5, 2026

Hi,
Following this comment and this config, I want to try bucketing, but I have several questions.
First, performance wise, is there any downside of using bucketing? Because since it will see all the data of the bucket at the same time. So we will probably see similar data since every dataset don't have the same audio duration distribution. Does it negatively impacts performance?
Will we be able to use weights for each dataset?
Bucket_duration_bins seems to be in number of tokens and not duration?

Answered by pzelasko

Feb 5, 2026

First, performance wise, is there any downside of using bucketing? Because since it will see all the data of the bucket at the same time. So we will probably see similar data since every dataset don't have the same audio duration distribution. Does it negatively impacts performance?

For sufficiently small data, yes, you will observer degradation. You can decrease the number of buckets then. Around 5-10k hours is where you definitely want to use bucketing, the diversity is large enough that the issue you described would not happen.
Bucket bins are determined based on data distribution to ensure roughly equal allocation of duration/tokens to each bucket to further alleviate the issue.

Wi…

View full answer

pzelasko · 2026-02-05T22:29:43Z

pzelasko
Feb 5, 2026
Maintainer

First, performance wise, is there any downside of using bucketing? Because since it will see all the data of the bucket at the same time. So we will probably see similar data since every dataset don't have the same audio duration distribution. Does it negatively impacts performance?

For sufficiently small data, yes, you will observer degradation. You can decrease the number of buckets then. Around 5-10k hours is where you definitely want to use bucketing, the diversity is large enough that the issue you described would not happen.
Bucket bins are determined based on data distribution to ensure roughly equal allocation of duration/tokens to each bucket to further alleviate the issue.

Will we be able to use weights for each dataset?

Yes. The distribution of datasets in specific batches will deviate from the weights, but the weights are still roughly what you get in expectation.

Bucket_duration_bins seems to be in number of tokens and not duration?

For ASR models and SpeechLM2 Duplex models it's duration. For SpeechLM2 SALM models it's tokens with token_equivalent_duration defining one token (e.g. 0.08 for typical fastconformer). Note that SpeechLM2 models can be jointly trained with text-only and audio+text data, so token is the common denominator.

1 reply

AudranBert Feb 6, 2026
Author

Ok thanks 👍

xXMrNidaXx · 2026-02-23T14:08:01Z

xXMrNidaXx
Feb 23, 2026

Good questions about bucketing tradeoffs!

Performance impact:

Pros of bucketing:
- Less padding waste (huge for variable-length audio)
- More efficient GPU utilization
- Can fit more samples per batch
Cons / risks:
- Correlation within batches — yes, similar-length samples often come from similar sources. Mitigation:
```
# Shuffle within buckets
bucket_shuffle: true
bucket_shuffle_seed: 42
```
- Epoch boundary effects — some buckets may finish before others

Dataset weights with bucketing:

Yes, you can combine both:

model:
  train_ds:
    manifest_filepath:
      - /data/dataset1/manifest.json
      - /data/dataset2/manifest.json
    weights: [0.6, 0.4]
    bucket_duration_bins: [4, 8, 12, 16, 20]  # seconds
    bucket_batch_size: [32, 24, 16, 12, 8]

bucket_duration_bins clarification:

It depends on the model:

For ASR: typically in seconds (audio duration)
For SpeechLM: can be tokens if using tokenized audio
Check the specific dataloader implementation

# In config, specify unit explicitly if supported
bucket_duration_bins: [4, 8, 12, 16]  # seconds
bucket_unit: "seconds"  # or "tokens"

Recommended setup:

bucketing_strategy: "precomputed"  # Compute bucket assignments once
bucket_cap_batches: 50  # Limit batch count per bucket for balance

We train speech models at Revolution AI — bucketing gives 20-40% throughput improvement on variable-length data. Worth the complexity!

1 reply

AudranBert Feb 24, 2026
Author

Hi, thanks, Can you tell me where bucket_cap_batches is in the documentation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SpeechLM2] Bucketing training data #15360

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[SpeechLM2] Bucketing training data #15360

Uh oh!

AudranBert Feb 5, 2026

Replies: 2 comments · 2 replies

Uh oh!

pzelasko Feb 5, 2026 Maintainer

Uh oh!

AudranBert Feb 6, 2026 Author

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

AudranBert Feb 24, 2026 Author

AudranBert
Feb 5, 2026

Replies: 2 comments 2 replies

pzelasko
Feb 5, 2026
Maintainer

AudranBert Feb 6, 2026
Author

xXMrNidaXx
Feb 23, 2026

AudranBert Feb 24, 2026
Author