Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stub out Skip Records API (#1792) #1998

Merged
merged 5 commits into from Jul 7, 2022
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 3, 2022

Which issue does this PR close?

Part of #1792

Rationale for this change

Stubs out an API for providing skip records functionality within parquet. I think this will work to support #1792, #1191 and potentially other functionality down the line.

Let me know what you think @Ted-Jiang @sunchao

What changes are included in this PR?

Stubs out APIs for adding row skipping logic to the parquet implementation

Are there any user-facing changes?

No 🎉

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 3, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jul 3, 2022

Codecov Report

Merging #1998 (c81b77d) into master (c757829) will decrease coverage by 0.15%.
The diff coverage is 62.29%.

❗ Current head c81b77d differs from pull request most recent head 2a572d7. Consider uploading reports for the commit 2a572d7 to get more accurate results

@@            Coverage Diff             @@
##           master    #1998      +/-   ##
==========================================
- Coverage   83.58%   83.42%   -0.16%     
==========================================
  Files         222      222              
  Lines       57522    57906     +384     
==========================================
+ Hits        48078    48309     +231     
- Misses       9444     9597     +153     
Impacted Files Coverage Δ
parquet/src/arrow/array_reader/byte_array.rs 84.47% <0.00%> (-1.24%) ⬇️
...et/src/arrow/array_reader/byte_array_dictionary.rs 82.26% <0.00%> (-1.66%) ⬇️
...uet/src/arrow/array_reader/complex_object_array.rs 93.20% <0.00%> (-1.07%) ⬇️
parquet/src/arrow/array_reader/empty_array.rs 45.45% <0.00%> (-10.11%) ⬇️
parquet/src/arrow/array_reader/list_array.rs 92.69% <0.00%> (-0.72%) ⬇️
parquet/src/arrow/array_reader/map_array.rs 58.82% <0.00%> (-8.98%) ⬇️
parquet/src/arrow/array_reader/mod.rs 88.23% <ø> (ø)
parquet/src/arrow/array_reader/null_array.rs 81.48% <0.00%> (-6.52%) ⬇️
parquet/src/arrow/array_reader/primitive_array.rs 88.63% <0.00%> (-1.02%) ⬇️
parquet/src/arrow/array_reader/struct_array.rs 78.99% <0.00%> (-9.69%) ⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c757829...2a572d7. Read the comment docs.

@Ted-Jiang
Copy link
Member

cool! 👍 @tustvold Are you the Flash 😄! i will try to go through this and give your my opinion today.

pub(crate) fn with_row_selection(
self,
selection: impl Into<Vec<RowSelection>>,
) -> Self {
Copy link
Member

@Ted-Jiang Ted-Jiang Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add total_row_count to check this selection is valid(maybe like continuous)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it actually an issue if it isn't, e.g. if I only want the first 100 rows?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, got it, it should check in user side.


/// Gets metadata about the next page, returns an error if no
/// column index information
fn peek_next_page(&self) -> Result<Option<PageMetadata>>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 really need this abstraction!

}

impl Iterator for ParquetRecordBatchReader {
type Item = ArrowResult<RecordBatch>;

fn next(&mut self) -> Option<Self::Item> {
match self.array_reader.next_batch(self.batch_size) {
let to_read = match self.selection.as_mut() {
Copy link
Member

@Ted-Jiang Ted-Jiang Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 pass mask here not each col is more reasonable 😂

Copy link
Member

@Ted-Jiang Ted-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think this abstraction is great ! Thanks for your effort!❤️

Left some comments, most are
Maybe after this pr merge, i will continue to work on page index.

/// API for reading pages from a column chunk.
/// This offers a iterator like API to get the next page.
pub trait PageReader: Iterator<Item = Result<Page>> + Send {
/// Gets the next page in the column chunk associated with this reader.
/// Returns `None` if there are no pages left.
fn get_next_page(&mut self) -> Result<Option<Page>>;

/// Gets metadata about the next page, returns an error if no
/// column index information
Copy link
Member

@Ted-Jiang Ted-Jiang Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there we only need offset index, without the min max index?🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so

parquet/src/arrow/record_reader/mod.rs Outdated Show resolved Hide resolved

self.consume_def_levels();
self.consume_rep_levels();
self.consume_record_data();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for the situation a page which has been read_records but left some unreaded buffer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, i don't get this point, why not directly call column_reader.skip_records(num_records)
could you give me some hint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecordReader is a bit of an odd cookie, let me try to explain what it is doing.

In the absence of repetition levels, it can simply read batch size levels, and the corresponding number of values.

However, if repetition levels are present, it will likely need to read more than batch_size levels in order to read batch_size actual records (rows).

To achieve this it reads to its internal buffer and then splits off the data corresponding to batch_size rows, leaving the excess behind.

It is this excess of data that has been read to its buffers but not yielded to the caller yet, which we must consume here

Copy link
Member

@Ted-Jiang Ted-Jiang Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice write up ! Save me some time 😄!
So, i got it. More specific details to ask:
This is a part of skip, we need to read the rp ,dp to skip some records in the page(maybe have been readed or never readed ).

let (buffered_records, buffered_values) = self.count_records(num_records);
        self.num_records += buffered_records;
        self.num_values += buffered_values;

        self.consume_def_levels();
        self.consume_rep_levels();
        self.consume_record_data();
        self.consume_bitmap();
        self.reset();

        let remaining = buffered_records - num_records;

This also part of skip, remaining > 0, I think this we skip start at a new page

        if remaining == 0 {
            return Ok(buffered_records);
        }

        let skipped = match self.column_reader.as_mut() {
            Some(column_reader) => column_reader.skip_records(remaining)?,
            None => 0,
        };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a part of skip, we need to read the rp ,dp to skip some records in the page(maybe have been readed or never readed ).

Yes, this is just to consume the data that has been read to the internal buffers of RecordReader if any

This also part of skip, remaining > 0, I think this we skip start at a new page

Not necessarily, the only thing RecordReader needs to handle is skipping any data that has already been read from ColumnReader into its own buffers. It can then delegate to ColumnReader to skip the remaining rows, with no requirement that this is done at a page boundary - ColumnReader must be able to handle any case.

@tustvold tustvold marked this pull request as ready for review July 5, 2022 12:46
tustvold and others added 2 commits July 5, 2022 09:23
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API looks good to me -- I had some questions and I think it would be nicer to return NotImplemented errors rather than panic in certain cases but I think this PR could also be merged as is to unblock further dev work

@@ -210,6 +214,10 @@ impl<I: OffsetSizeTrait + ScalarValue> ColumnValueDecoder

decoder.read(out, range.end - range.start, self.dict.as_ref())
}

fn skip_values(&mut self, _num_values: usize) -> Result<usize> {
todo!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a ticket reference here like
unimplemented!("See https://github.com/apache/arrow-rs/.....") would help future readers

Bonus points for returning ArrowError::Unimplemented

This comment applies to everything below as well


/// Scan rows from the parquet file according to the provided `selection`
///
/// TODO: Make public once row selection fully implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps worth a ticket?

/// [`RowSelection`] allows selecting or skipping a provided number of rows
/// when scanning the parquet file
#[derive(Debug, Clone, Copy)]
pub(crate) struct RowSelection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably already have thought about this, but I would expect that in certain scenarios, non contiguous rows / skips would be desired

Like "fetch the first 100 rows, skip the next 200, and then fetch the remaining"

Would this interface handle that case?

Copy link
Contributor Author

@tustvold tustvold Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See with_row_selection which takes a Vec to allow for this use-case

@@ -555,6 +555,14 @@ impl<T: Read + Send> PageReader for SerializedPageReader<T> {
// We are at the end of this column chunk and no more page left. Return None.
Ok(None)
}

fn peek_next_page(&self) -> Result<Option<PageMetadata>> {
todo!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto returning "not yet implemented" would probably be nicer

@@ -146,15 +146,15 @@ impl LevelsBufferSlice for DefinitionLevelBuffer {
}
}

pub struct DefinitionLevelDecoder {
pub struct DefinitionLevelBufferDecoder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I this rename a public API change as well? It does not appear in the docs

https://docs.rs/parquet/17.0.0/parquet/?search=DefinitionLevelDecoder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is crate local

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants