New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add bloom filter when write sst #370
Conversation
FWIW parquet has a form of bloom filter support, and I, and likely others, would be very interested in collaborating to get support added to arrow-rs/DataFusion if this was something you were amenable to. Just thought I'd mention it, as it has come up in a few IOx discussions lately 😄 |
This is very cool work @jiacai2050 👍 Along with @tustvold I would love to help collaborate on getting Bloom filter support into parquet (and then also datafusion) -- my first contribution to the process is writing up a ticket apache/arrow-rs#3023 |
@tustvold @alamb Thanks for your tips. Great write-up. The bloom filter support in CeresDB is in early stage, we have just drafted an initial design how to add bloom filter, which is different from what parquet-format said. Simply put, bloom filter is row-group based, and we intend to encode them in We will verify this design in our case ASAP, if it's works as expected, we would be happy to contribute it to parquet crates. |
Sounds like a good plan -- thank you @jiacai2050 ! I think using a custom format in the metadata sounds like a good idea for CeresDB initially (it will be much faster than working it through the rest of the ecosystem). We would love to hear about your experience in implementing it |
a1078b7
to
81d3c2f
Compare
81d3c2f
to
9124692
Compare
851f98e
to
10ee2dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
🎉 |
* add bloom filter when write sst * fix compile * remove unwrap, add more error types * fix test * make pending batch local var * Address comments * fix partition record batch * fix unittest * add break * add more test cases
Which issue does this PR close?
Part of #363
Rationale for this change
Described in #363, bloom filter is beneficial for columns with high cardinality, so this PR append bloom filter to sst's meta when write SST files.
What changes are included in this PR?
Update sst write process, loop record batch twice:
Are there any user-facing changes?
No
How does this change test
Add UT: test_partition_record_batch
After we support prune by bloom filter, we can add more test cases.