Skip to content

Conversation

@jimexist
Copy link
Member

@jimexist jimexist commented Jan 17, 2026

Which issue does this PR close?

adjust sbbf sizing for better cache efficiency, to make it more conformant with the spec

Rationale for this change

original rust implementation actually only used the vanilla math for calculating the ideal size, not adjusting (by compensating) size.

this fix will introduce the adjusted value per the original paper.

What changes are included in this PR?

  • increased size reservation in exchange for better cache efficiency

Are these changes tested?

  • adjusted unit test values
NDV: 100, Target FPP: 0.1
  Naive: 128 bytes, actual FPP: 0.011359 ✓
  Compensated: 128 bytes, actual FPP: 0.011359 ✓ meets target

NDV: 1000, Target FPP: 0.1
  Naive: 1024 bytes, actual FPP: 0.030103 ✓
  Compensated: 1024 bytes, actual FPP: 0.030103 ✓ meets target

NDV: 10000, Target FPP: 0.1
  Naive: 8192 bytes, actual FPP: 0.072424 ✓
  Compensated: 16384 bytes, actual FPP: 0.003541 ✓ meets target

NDV: 100, Target FPP: 0.01
  Naive: 128 bytes, actual FPP: 0.011359 ❌ EXCEEDS TARGET
  Compensated: 256 bytes, actual FPP: 0.000371 ✓ meets target

NDV: 1000, Target FPP: 0.01
  Naive: 2048 bytes, actual FPP: 0.001167 ✓
  Compensated: 2048 bytes, actual FPP: 0.001167 ✓ meets target

NDV: 10000, Target FPP: 0.01
  Naive: 16384 bytes, actual FPP: 0.003541 ✓
  Compensated: 16384 bytes, actual FPP: 0.003541 ✓ meets target

NDV: 100, Target FPP: 0.001
  Naive: 256 bytes, actual FPP: 0.000371 ✓
  Compensated: 256 bytes, actual FPP: 0.000371 ✓ meets target

NDV: 1000, Target FPP: 0.001
  Naive: 2048 bytes, actual FPP: 0.001167 ❌ EXCEEDS TARGET
  Compensated: 4096 bytes, actual FPP: 0.000032 ✓ meets target

NDV: 10000, Target FPP: 0.001
  Naive: 32768 bytes, actual FPP: 0.000103 ✓
  Compensated: 32768 bytes, actual FPP: 0.000103 ✓ meets target

NDV: 100, Target FPP: 0.0001
  Naive: 512 bytes, actual FPP: 0.000010 ✓
  Compensated: 512 bytes, actual FPP: 0.000010 ✓ meets target

NDV: 1000, Target FPP: 0.0001
  Naive: 4096 bytes, actual FPP: 0.000032 ✓
  Compensated: 4096 bytes, actual FPP: 0.000032 ✓ meets target

NDV: 10000, Target FPP: 0.0001
  Naive: 32768 bytes, actual FPP: 0.000103 ❌ EXCEEDS TARGET
  Compensated: 32768 bytes, actual FPP: 0.000103 ❌ EXCEEDS TARGET

NDV: 100, Target FPP: 0.00001
  Naive: 512 bytes, actual FPP: 0.000010 ✓
  Compensated: 1024 bytes, actual FPP: 0.000000 ✓ meets target

NDV: 1000, Target FPP: 0.00001
  Naive: 4096 bytes, actual FPP: 0.000032 ❌ EXCEEDS TARGET
  Compensated: 8192 bytes, actual FPP: 0.000001 ✓ meets target

NDV: 10000, Target FPP: 0.00001
  Naive: 65536 bytes, actual FPP: 0.000003 ✓
  Compensated: 65536 bytes, actual FPP: 0.000003 ✓ meets target

=== Summary ===
Naive implementation failures (exceeded target FPP): 4/15
Compensated implementation failures: 1/15
test bloom_filter::tests::test_compensation_benefit ... ok

Are there any user-facing changes?

@jimexist jimexist requested a review from Copilot January 17, 2026 01:41
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 17, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the SBBF (Split Block Bloom Filter) sizing calculation to improve cache efficiency and better conform to the Parquet format specification. The implementation introduces a compensation table based on the referenced academic paper to account for overhead from the Poisson distribution of elements across blocks.

Changes:

  • Replaced simple theoretical calculation with a compensation table-based approach
  • Added linear interpolation helper function for table lookups
  • Updated test expectations to reflect new sizing calculations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jimexist jimexist changed the title adjust sbbf sizing for better cache efficiency parquet: adjust bloom filter sbbf sizing for better cache efficiency Jan 17, 2026
@jimexist jimexist changed the title parquet: adjust bloom filter sbbf sizing for better cache efficiency parquet: adjust bloom filter (sbbf) sizing for better cache efficiency Jan 17, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jimexist jimexist changed the title parquet: adjust bloom filter (sbbf) sizing for better cache efficiency parquet: adjust bloom filter (sbbf) sizing for better fpp and cache efficiency Jan 17, 2026
@jimexist jimexist requested a review from Copilot January 17, 2026 13:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bloom filter size table does not match implementations

1 participant