Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ArrayReader::skip_records API #2197

Closed
tustvold opened this issue Jul 27, 2022 · 4 comments
Closed

Add ArrayReader::skip_records API #2197

tustvold opened this issue Jul 27, 2022 · 4 comments
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The skip records API added to the ArrayReader trait as part of #1998 does not provide a way to combine multiple selections into the same batch. This is unfortunate as columnar query engines will often want consistently large RecordBatch so that any dispatch overheads can be amortised over many rows. Whilst it could concatenate batches together, e.g. DataFusion's CoalesceBatchesExec, it would be more efficient to do this directly on read and eliminate an additional copy.

Ultimately doing this is supported by the underlying machinery, i.e. RecordReader, it just isn't exposed by ArrayReader

Describe the solution you'd like

Much like RecordReader we need to separate read_records from consuming the resulting data, i.e. replace ArrayReader::next_batch with ArrayReader::read_records and ArrayReader::consume_batch.

Describe alternatives you've considered

We could not do this, however, if we are going to make this change we should probably do it before we make the record skipping API public (#1792)

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Jul 27, 2022
@tustvold tustvold assigned tustvold and unassigned tustvold Jul 27, 2022
@tustvold
Copy link
Contributor Author

Thoughts @Ted-Jiang ?

@Ted-Jiang
Copy link
Member

Yes, I agree this need improvement before make api public.

Much like RecordReader we need to separate read_records from consuming the resulting data, i.e. replace ArrayReader::next_batch with ArrayReader::read_records and ArrayReader::consume_batch.

I think you mean: we can call read_records multiple times until there are enough values in buf then we can call consume_batch. To make sure avoid small data patch (now if selection_len less than batch_size will return a batch with selection_len rows ).

How about make this combine logic in impl Iterator for ParquetRecordBatchReader ,if we call read_records multiple times it should depend on the selections, why not add a loop check in Iterator to feed enough rows in result batch🤔

@tustvold
Copy link
Contributor Author

why not add a loop check in Iterator to feed enough rows in result batch

Agreed, ParquetRecordBatchReader will need to have its logic modified to drive these new methods

@Ted-Jiang
Copy link
Member

Ted-Jiang commented Jul 29, 2022

@tustvold Are you working on this ? Maybe i can implement this tomorrow 😊?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants