[SpeechLM2] Bucketing training data #15360
-
|
Hi, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
For sufficiently small data, yes, you will observer degradation. You can decrease the number of buckets then. Around 5-10k hours is where you definitely want to use bucketing, the diversity is large enough that the issue you described would not happen.
Yes. The distribution of datasets in specific batches will deviate from the weights, but the weights are still roughly what you get in expectation.
For ASR models and SpeechLM2 Duplex models it's duration. For SpeechLM2 SALM models it's tokens with |
Beta Was this translation helpful? Give feedback.
-
|
Good questions about bucketing tradeoffs! Performance impact:
Dataset weights with bucketing: Yes, you can combine both: model:
train_ds:
manifest_filepath:
- /data/dataset1/manifest.json
- /data/dataset2/manifest.json
weights: [0.6, 0.4]
bucket_duration_bins: [4, 8, 12, 16, 20] # seconds
bucket_batch_size: [32, 24, 16, 12, 8]bucket_duration_bins clarification: It depends on the model:
# In config, specify unit explicitly if supported
bucket_duration_bins: [4, 8, 12, 16] # seconds
bucket_unit: "seconds" # or "tokens"Recommended setup: bucketing_strategy: "precomputed" # Compute bucket assignments once
bucket_cap_batches: 50 # Limit batch count per bucket for balanceWe train speech models at Revolution AI — bucketing gives 20-40% throughput improvement on variable-length data. Worth the complexity! |
Beta Was this translation helpful? Give feedback.
For sufficiently small data, yes, you will observer degradation. You can decrease the number of buckets then. Around 5-10k hours is where you definitely want to use bucketing, the diversity is large enough that the issue you described would not happen.
Bucket bins are determined based on data distribution to ensure roughly equal allocation of duration/tokens to each bucket to further alleviate the issue.