Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add bloom filter when write sst #370

Merged
merged 10 commits into from Nov 8, 2022
Merged

Conversation

jiacai2050
Copy link
Contributor

@jiacai2050 jiacai2050 commented Nov 4, 2022

Which issue does this PR close?

Part of #363

Rationale for this change

Described in #363, bloom filter is beneficial for columns with high cardinality, so this PR append bloom filter to sst's meta when write SST files.

What changes are included in this PR?

Update sst write process, loop record batch twice:

  • first to build bloom filter
  • second to write to underlying storage.

Note: for now, all columns is appended with bloom filter, some columns may not suitable for it, we can remove them in future to reduce meta size.

Are there any user-facing changes?

No

How does this change test

Add UT: test_partition_record_batch

After we support prune by bloom filter, we can add more test cases.

@tustvold
Copy link

tustvold commented Nov 4, 2022

FWIW parquet has a form of bloom filter support, and I, and likely others, would be very interested in collaborating to get support added to arrow-rs/DataFusion if this was something you were amenable to. Just thought I'd mention it, as it has come up in a few IOx discussions lately 😄

@alamb
Copy link

alamb commented Nov 5, 2022

This is very cool work @jiacai2050 👍

Along with @tustvold I would love to help collaborate on getting Bloom filter support into parquet (and then also datafusion) -- my first contribution to the process is writing up a ticket apache/arrow-rs#3023

@jiacai2050
Copy link
Contributor Author

@tustvold @alamb Thanks for your tips. Great write-up.

The bloom filter support in CeresDB is in early stage, we have just drafted an initial design how to add bloom filter, which is different from what parquet-format said.

Simply put, bloom filter is row-group based, and we intend to encode them in key_value_metadata instead of in each row group's meta, so we can read them out in one IO.

We will verify this design in our case ASAP, if it's works as expected, we would be happy to contribute it to parquet crates.

@alamb
Copy link

alamb commented Nov 5, 2022

Simply put, bloom filter is row-group based, and we intend to encode them in key_value_metadata instead of in each row group's meta, so we can read them out in one IO.

Sounds like a good plan -- thank you @jiacai2050 ! I think using a custom format in the metadata sounds like a good idea for CeresDB initially (it will be much faster than working it through the rest of the ecosystem). We would love to hear about your experience in implementing it

@jiacai2050 jiacai2050 marked this pull request as ready for review November 7, 2022 08:01
common_types/src/datum.rs Outdated Show resolved Hide resolved
analytic_engine/src/sst/file.rs Outdated Show resolved Hide resolved
analytic_engine/src/sst/file.rs Outdated Show resolved Hide resolved
analytic_engine/src/sst/file.rs Show resolved Hide resolved
analytic_engine/src/sst/file.rs Outdated Show resolved Hide resolved
proto/protos/sst.proto Outdated Show resolved Hide resolved
analytic_engine/src/sst/parquet/builder.rs Outdated Show resolved Hide resolved
analytic_engine/src/sst/parquet/builder.rs Outdated Show resolved Hide resolved
analytic_engine/src/sst/parquet/builder.rs Show resolved Hide resolved
Copy link
Member

@ShiKaiWi ShiKaiWi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShiKaiWi ShiKaiWi merged commit 9d9d422 into apache:main Nov 8, 2022
@jiacai2050 jiacai2050 deleted the feat-add-bf branch November 8, 2022 09:20
@alamb
Copy link

alamb commented Nov 8, 2022

🎉

chunshao90 pushed a commit to chunshao90/ceresdb that referenced this pull request May 15, 2023
* add bloom filter when write sst

* fix compile

* remove unwrap, add more error types

* fix test

* make pending batch local var

* Address comments

* fix partition record batch

* fix unittest

* add break

* add more test cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants