-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Parquet] Support skipping pages with mask based evaluation #9118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sdf-jkl -- this actually makes a lot of sense to me 👏
I have a few concerns:
- I am worried about the performance overhead of this approach (copying the page index and the loop for each batch) -- I will run some benchmarks to asses this
- I do wonder if we have test coverage for this entire situation -- in particular, do we have tests that repeatedly call
next_mask_chunkafter the first page and make sure we get the right rows?
If the performance looks good, I think we should add some more tests -- maybe @hhhizzz has some ideas on how to do this (or I think I can try and find some time to help out / work with codex to do so)
| /// Using the row selection to skip(4), page2 won't be read at all, so in this | ||
| /// case we can't decode all the rows and apply a mask. To correctly apply the | ||
| /// bit mask, we need all 6 values be read, but page2 is not in memory. | ||
| fn override_selector_strategy_if_needed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice -- the idea is to avoid this function 👍
| array_reader, | ||
| schema: Arc::new(schema), | ||
| read_plan, | ||
| page_offsets: page_offsets.map(|slice| Arc::new(slice.to_vec())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think this will effectively will copy the entire OffsetIndexMetadataData structure (which I worry could be quite large)
I wonder if we need to find a way to avoid this (e.g. making the entire thing Arc'd in https://github.com/apache/arrow-rs/blob/67e04e758f1e62ec3d78d2f678daf433a4c54e30/parquet/src/file/metadata/mod.rs#L197-L196 somehow 🤔 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could store only the &Vec<PageLocation> instead of the entire OffsetIndexMetadataData df9a493
| while cursor < mask.len() && selected_rows < batch_size { | ||
| let mut page_end = mask.len(); | ||
| if let Some(pages) = page_locations { | ||
| for loc in pages { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also a little worried that this loop will take too long (it is O(N^2) in the number of pages as each time it looks through all pages
Maybe we could potentially add a PageLocationIterator to the cursor itself (so we know where to pick up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a binary search through a vec of page offsets? Would have to construct the vec once beforehand to keep us from rebuilding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in df9a493
|
run benchmark arrow_reader_clickbench arrow_reader_row_filter |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_reader_clickbench arrow_reader_row_filter |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
@alamb @Dandandan clickbench q12, 24, 30 show some degradation, but everything else looks like an overall improvement. |
|
|
||
| let reader = ParquetRecordBatchReader::new(array_reader, plan); | ||
| let reader = | ||
| ParquetRecordBatchReader::new(array_reader, plan, page_offsets.cloned()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cloned may cause extra expense here, can we use Arc<[PageLocation]> to avoid that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a big api change to make PageLocation or OffsetIndexMetadataData an Arc inside ParquetMetaData.
If we'd want to make that change, I can open an issue and work up a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @hhhizzz that copying the offsets here is not good
I thought about it some more, and I think the reason the copy is currently needed is that the decision of should the page be skipped is postponed until the next MaskChunk is needed
One potential idea I had to avoid this, is to use the page index in the ReadPlanBuilder when building, rather than pass in the page index to every call for next_batch.
So maybe that would look something like extending MaskCursor from
/// Cursor for iterating a mask-backed [`RowSelection`]
///
/// This is best for dense selections where there are many small skips
/// or selections. For example, selecting every other row.
#[derive(Debug)]
pub struct MaskCursor {
mask: BooleanBuffer,
/// Current absolute offset into the selection
position: usize,
}To also track what ranges should be skipped entirely. Maybe something like
#[derive(Debug)]
pub struct MaskCursor {
mask: BooleanBuffer,
/// Current absolute offset into the selection
position: usize,
/// Which row ranges should be skipped entirely?
skip_ranges: Vec<Range<usize>>,
}That I think would simplify the logic for next_mask_chunk significantly and it would avoid the need to copy the entire page inde
|
Thank you! @sdf-jkl , the code look great, just wondering if we could add more Unit tests. |
|
Here's the exists test: arrow-rs/parquet/src/arrow/async_reader/mod.rs Line 1218 in 13d497a
I think we can just add one more UT to test the skipping page with RowSelectionPolicy set to Mask instead of Auto
|
Shouldn't these test resolve to |
This test doesn’t cover the |
Unit tests for Bitmask skip page
|
Thank you @hhhizzz, I merged the whole thing |
|
I hope to review this soon |
|
FYI it turns out we hit some bug in this logic when we deployed the new predicate evaluator to production at InfluxData, see Thanks to @erratic-pattern we also have a nice test reproducer. Since it seems to be related to this logic I plan to review this PR now and try and figure out how to help |
|
I'll merge main and fix clippy, to make it easier to review. I could also remove the #[should_panic] from the new test. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sdf-jkl and @hhhizzz -- I took a look at this PR and it is looking like it is heading in the right direction
I had some structural suggestions and I also have an idea for some additional coverage (related to predicates).
Please let me know if you are willing to work on this, otherwise I am happy to take over this PR as well (given we are hitting the problem at work, and it is blocking our upgrade)
| ) | ||
| }) { | ||
| self.row_group_offset_index(row_group_idx) | ||
| .and_then(|columns| columns.first()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug -- it reads the page offsets from the first column rater than the column being read
Maybe something like
self.row_group_offset_index(row_group_idx).and_then(|columns| {
columns
.iter()
.enumerate()
.find(|(leaf_idx, _)| self.projection.leaf_included(*leaf_idx))
.map(|(_, column)| column.page_locations())There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the page offsets be same for every column? It is, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even this should not work, because we actually need to keep page offsets for all projected columns and use it in ReadPlanBuilder(once we move it from ParquetRecordBatchReader)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess we go back to using the whole &[OffsetIndexMetaData]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to find some time this afternoon to work on this PR -- maybe I will come up with something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue with the current implementation is that ParquetRecordBatchReader is working page aware using pages offsets from a single column.
However, read happens for all columns at once, using the same boolean mask(which is col chunk specific).
https://github.com/apache/arrow-rs/pull/9118/changes#diff-850b3a44587149637b8545f66603a2b1252959fd36f7ddc55f37d6b5357816c6L1403
It seems that supporting different page offsets for each col would require us to push page awareness further down into the arrow readers.
|
|
||
| while !mask_cursor.is_empty() { | ||
| let Some(mask_chunk) = mask_cursor.next_mask_chunk(batch_size) else { | ||
| let Some(mask_chunk) = mask_cursor.next_mask_chunk(batch_size, page_locations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that this API needs to be extended -- it needs to be able to represent "skip the next N rows without trying to decode them"
As written here I think the first page that doesn't have any rows selected will return None (which will trigger the reader to think it is at the end of the file, even if there is data left)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reader only thinks it's the end of the file when no further rows remain in mask_cursor. Empty page is handled by initial skip in next_mask_chunk.
|
|
||
| let reader = ParquetRecordBatchReader::new(array_reader, plan); | ||
| let reader = | ||
| ParquetRecordBatchReader::new(array_reader, plan, page_offsets.cloned()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @hhhizzz that copying the offsets here is not good
I thought about it some more, and I think the reason the copy is currently needed is that the decision of should the page be skipped is postponed until the next MaskChunk is needed
One potential idea I had to avoid this, is to use the page index in the ReadPlanBuilder when building, rather than pass in the page index to every call for next_batch.
So maybe that would look something like extending MaskCursor from
/// Cursor for iterating a mask-backed [`RowSelection`]
///
/// This is best for dense selections where there are many small skips
/// or selections. For example, selecting every other row.
#[derive(Debug)]
pub struct MaskCursor {
mask: BooleanBuffer,
/// Current absolute offset into the selection
position: usize,
}To also track what ranges should be skipped entirely. Maybe something like
#[derive(Debug)]
pub struct MaskCursor {
mask: BooleanBuffer,
/// Current absolute offset into the selection
position: usize,
/// Which row ranges should be skipped entirely?
skip_ranges: Vec<Range<usize>>,
}That I think would simplify the logic for next_mask_chunk significantly and it would avoid the need to copy the entire page inde
|
Definitely willing to work on this, thanks for the review and your input! |
|
Awesome -- thanks @sdf-jkl -- I will switch focus for the rest of today and check back in tomorrow. |
| let props = WriterProperties::builder() | ||
| .set_write_batch_size(2) | ||
| .set_data_page_row_count_limit(2) | ||
| .build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason why tests pass is because page offsets are the same for any column.
We limit pages by row_count, not by size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually the same in the new test too...
|
@alamb It seems like I'm on to something with codex. The test passes, but I want to give it a read and a little polish first before sending it your way. |
|
It seems like the issue was caused by different sized pages after all. Bigger types would have more smaller "finer" pages and smaller types would have less bigger "coarser" pages. If the column with coarse pages was used to enable page awareness we would use it's page offsets. In the example above col A with "coarse" pages overlaps with "finer" pages in Col B that were skipped during data fetch. This lead to the invalid offsets issue. |
| /// Add offset index metadata for each column in a row group to this `ReadPlanBuilder` | ||
| pub fn with_offset_index_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the offsets of the column with the smallest number of rows per page should prevent the invalid offset issue from happening.
Which issue does this PR close?
Rationale for this change
Check issue.
What changes are included in this PR?
Made
next_mask_chunkpage aware.By adding
page_offsetstoParquetRecordBatchReaderAre these changes tested?
Should be covered by existing tests from #8733
Are there any user-facing changes?