-
Notifications
You must be signed in to change notification settings - Fork 261
feat: Add batch coalescing ability to shuffle reader exec #1380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add batch coalescing ability to shuffle reader exec #1380
Conversation
|
cc @milenkovicm |
|
But one wired thing is the performance seems not improve based on benchmark test result, I used the After Before |
|
thanks @danielhumanmod will have a look asap |
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @danielhumanmod
I think this makes sense, i have just few minor comments.
I would also propose postponing merge of this for a bit, @andygrove is working on automated benchmark framework, would be good to benchmark overall performance.
|
|
||
| log::debug!( | ||
| "ShuffleReaderExec::execute({task_id}) max_request_num: {max_request_num}, max_message_size: {max_message_size}" | ||
| "ShuffleReaderExec::execute({task_id}) max_request_num: {max_request_num}, max_message_size: {max_message_size}, batch_size: {batch_size}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need batch size here? its part of configuration, we can find out if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, will remove
| Ok(Box::pin(CoalescedShuffleReaderStream { | ||
| schema: self.schema.clone(), | ||
| input: input_stream, | ||
| coalescer: LimitedBatchCoalescer::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason using LimitedBatchCoalescer instead of BatchCoalescer as there is no fetch limit?
would there be a case where we can use limit 🤔?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could add limit to ShuffleReader, defaulting to None, and add with_limit option, so if we find a use-case we can set it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason using
LimitedBatchCoalescerinstead ofBatchCoalesceras there is no fetch limit?would there be a case where we can use limit 🤔?
Regarding LimitedBatchCoalescer, I initially chose it because it offers better encapsulation for the stream state machine compared to the raw BatchCoalescer. Specifically, it manages the finished state and provides PushBatchStatus, which makes the poll_next implementation cleaner and safer, even without a limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could add limit to ShuffleReader, defaulting to None, and add
with_limitoption, so if we find a use-case we can set it ?
Sounds like a good point. One use case I can think of is optimizations like Limit Pushdown to the shuffle read stage
| let target_batch_size = 10; | ||
|
|
||
| // 4. Manually build the CoalescedShuffleReaderStream | ||
| let coalesced_stream = CoalescedShuffleReaderStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe adding 'new' method would make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will add, thanks!
|
@andygrove if you need verification for benchmarking framework this PR is a good candidate |
|
For benchmarking the performance change for small batches, probably it is better to run against Parquet (to make the scan less of a bottleneck) and with a high number of task slots (to make the batch size small). |
|
@sqlbenchmark run tpch |
Ballista TPC-H Benchmark ResultsPR: #1380 - impl ResultsCould not parse structured results. Raw output: Main branch: PR branch: Automated benchmark run by dfbench |
|
@milenkovicm the benchmark script failed because this PR does not contain the changes from #1391 |
|
Perhaps @danielhumanmod could rebase the PR |
cf58271 to
118b6e8
Compare
|
Rebased, try benchmark test again |
|
@sqlbenchmark run tpch |
3 similar comments
|
@sqlbenchmark run tpch |
|
@sqlbenchmark run tpch |
|
@sqlbenchmark run tpch |
Ballista TPC-H Benchmark ResultsPR: #1380 - fix format and simplify new Query Comparison
Total: Main=25278.90ms, PR=25549.30ms (+1.1%) Automated benchmark run by dfbench |
There is currently a list of contributors that have permissions to run this. I will open this up to more people once it has been through more testing. It is very experimental at the moment. |
|
@sqlbenchmark run tpch -s 10 -i 3 |
Ballista TPC-H Benchmark ResultsPR: #1380 - fix format and simplify new Query Comparison
Total: Main=25875.50ms, PR=25860.90ms (-0.1%) Automated benchmark run by dfbench |
Ballista TPC-H Benchmark ResultsPR: #1380 - fix format and simplify new Query Comparison
Total: Main=65316.90ms, PR=63255.40ms (-3.2%) Automated benchmark run by dfbench |
milenkovicm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmarks look ok, i think this change makes sense.
thanks @danielhumanmod
Which issue does this PR close?
Closes #1379.
Rationale for this change
Currently,
ShuffleReaderExecpasses batches directly from the upstreamAbortableReceiverStreamto downstream operators. In scenarios with high parallelism or network fragmentation, the shuffle reader can receives small record batches.This causes performance degradation because:
poll_next).What changes are included in this PR?
Core Logic: Introduced
CoalescedShuffleReaderStreamas a stream adapter to buffer and merge small batches using DataFusion'sBatchCoalescer.Integration: Updated
ShuffleReaderExec::executeto apply this wrapper, enabling configurable batch sizing via SessionConfig and preserving existing metrics.Are there any user-facing changes?
No