Skip to content
Discussion options

You must be logged in to vote

First, performance wise, is there any downside of using bucketing? Because since it will see all the data of the bucket at the same time. So we will probably see similar data since every dataset don't have the same audio duration distribution. Does it negatively impacts performance?

For sufficiently small data, yes, you will observer degradation. You can decrease the number of buckets then. Around 5-10k hours is where you definitely want to use bucketing, the diversity is large enough that the issue you described would not happen.
Bucket bins are determined based on data distribution to ensure roughly equal allocation of duration/tokens to each bucket to further alleviate the issue.

Wi…

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
1 reply
@AudranBert
Comment options

Answer selected by AudranBert
Comment options

You must be logged in to vote
1 reply
@AudranBert
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants