Add `ArrayReader::skip_records` API #2197

tustvold · 2022-07-27T15:38:35Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The skip records API added to the ArrayReader trait as part of #1998 does not provide a way to combine multiple selections into the same batch. This is unfortunate as columnar query engines will often want consistently large RecordBatch so that any dispatch overheads can be amortised over many rows. Whilst it could concatenate batches together, e.g. DataFusion's CoalesceBatchesExec, it would be more efficient to do this directly on read and eliminate an additional copy.

Ultimately doing this is supported by the underlying machinery, i.e. RecordReader, it just isn't exposed by ArrayReader

Describe the solution you'd like

Much like RecordReader we need to separate read_records from consuming the resulting data, i.e. replace ArrayReader::next_batch with ArrayReader::read_records and ArrayReader::consume_batch.

Describe alternatives you've considered

We could not do this, however, if we are going to make this change we should probably do it before we make the record skipping API public (#1792)

The text was updated successfully, but these errors were encountered:

tustvold · 2022-07-27T15:43:09Z

Thoughts @Ted-Jiang ?

Ted-Jiang · 2022-07-28T03:37:04Z

Yes, I agree this need improvement before make api public.

Much like RecordReader we need to separate read_records from consuming the resulting data, i.e. replace ArrayReader::next_batch with ArrayReader::read_records and ArrayReader::consume_batch.

I think you mean: we can call read_records multiple times until there are enough values in buf then we can call consume_batch. To make sure avoid small data patch (now if selection_len less than batch_size will return a batch with selection_len rows ).

How about make this combine logic in impl Iterator for ParquetRecordBatchReader ，if we call read_records multiple times it should depend on the selections, why not add a loop check in Iterator to feed enough rows in result batch🤔

tustvold · 2022-07-28T10:04:00Z

why not add a loop check in Iterator to feed enough rows in result batch

Agreed, ParquetRecordBatchReader will need to have its logic modified to drive these new methods

Ted-Jiang · 2022-07-29T02:24:14Z

@tustvold Are you working on this ? Maybe i can implement this tomorrow 😊?

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Jul 27, 2022

tustvold assigned tustvold and unassigned tustvold Jul 27, 2022

This was referenced Jul 27, 2022

Integrate Record Skipping into Column Reader Fuzz Tests #2198

Closed

Cleanup record skipping logic and tests (#2158) #2199

Merged

tustvold assigned Ted-Jiang Jul 29, 2022

This was referenced Jul 30, 2022

Separate ArrayReader::next_batchwith ArrayReader::read_records and ArrayReader::consume_batch #2236

Closed

Separate ArrayReader::next_batch with read_records and consume_batch #2237

Merged

tustvold closed this as completed Aug 7, 2022

Ted-Jiang mentioned this issue Aug 8, 2022

Combine multiple selections into the same batch size in skip_records #2358

Closed

alamb added the parquet Changes to the parquet crate label Aug 10, 2022

alamb changed the title ~~ArrayReader::skip_records API~~ Add ArrayReader::skip_records API Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `ArrayReader::skip_records` API #2197

Add `ArrayReader::skip_records` API #2197

tustvold commented Jul 27, 2022

tustvold commented Jul 27, 2022

Ted-Jiang commented Jul 28, 2022

tustvold commented Jul 28, 2022

Ted-Jiang commented Jul 29, 2022 •

edited

Navigation Menu

Add ArrayReader::skip_records API #2197

Add ArrayReader::skip_records API #2197

Comments

tustvold commented Jul 27, 2022

tustvold commented Jul 27, 2022

Ted-Jiang commented Jul 28, 2022

tustvold commented Jul 28, 2022

Ted-Jiang commented Jul 29, 2022 • edited

Add `ArrayReader::skip_records` API #2197

Add `ArrayReader::skip_records` API #2197

Ted-Jiang commented Jul 29, 2022 •

edited