Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Overriding AsyncFileReader used by ParquetExec #3051

Merged
merged 25 commits into from Aug 9, 2022

Conversation

Cheappie
Copy link
Contributor

@Cheappie Cheappie commented Aug 6, 2022

First steps towards overriding AsyncFileReader used by ParquetExec
#2992

@github-actions github-actions bot added the core Core datafusion crate label Aug 6, 2022
@codecov-commenter
Copy link

codecov-commenter commented Aug 6, 2022

Codecov Report

Merging #3051 (70bc233) into master (098f0b0) will increase coverage by 0.01%.
The diff coverage is 92.45%.

❗ Current head 70bc233 differs from pull request most recent head e18d26f. Consider uploading reports for the commit e18d26f to get more accurate results

@@            Coverage Diff             @@
##           master    #3051      +/-   ##
==========================================
+ Coverage   85.93%   85.95%   +0.01%     
==========================================
  Files         289      290       +1     
  Lines       52118    52243     +125     
==========================================
+ Hits        44788    44903     +115     
- Misses       7330     7340      +10     
Impacted Files Coverage Δ
datafusion/core/src/datasource/listing/mod.rs 55.55% <ø> (ø)
...afusion/core/src/physical_plan/file_format/avro.rs 0.00% <ø> (ø)
...tafusion/core/src/physical_plan/file_format/mod.rs 96.95% <50.00%> (-0.42%) ⬇️
.../core/src/physical_plan/file_format/file_stream.rs 88.88% <69.23%> (-2.10%) ⬇️
datafusion/core/src/datasource/listing/helpers.rs 95.01% <93.75%> (+0.05%) ⬆️
...sion/core/src/physical_plan/file_format/parquet.rs 95.34% <93.93%> (-0.12%) ⬇️
datafusion/core/tests/custom_parquet_reader.rs 95.50% <95.50%> (ø)
datafusion/core/src/datasource/file_format/mod.rs 91.30% <100.00%> (+0.39%) ⬆️
...afusion/core/src/datasource/file_format/parquet.rs 86.26% <100.00%> (ø)
...tafusion/core/src/physical_plan/file_format/csv.rs 94.47% <100.00%> (ø)
... and 3 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good step forward, left some minor nits

datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
}
}

struct ThinFileReader {
Copy link
Contributor

@tustvold tustvold Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably fix the need for this upstream. In the meantime perhaps we could called this BoxedAsyncFileReader to make clear what it is for.

Might also make it a tuple struct to hammer home the fact it is just a newtype wrapper. Doc comment would also help

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, by the way do you know any better way to solve that than for example making upstream accept Box<dyn AsyncFileReader + Send> in constructor ?

Copy link
Contributor

@tustvold tustvold Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within parquet we could impl AsyncFileReader for Box<dyn AsyncFileReader>, but tbh I wonder if we should just type-erase by default 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/listing/mod.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you 👍

@Cheappie Cheappie changed the title [WIP] Allow Overriding AsyncFileReader used by ParquetExec Allow Overriding AsyncFileReader used by ParquetExec Aug 8, 2022
@Cheappie
Copy link
Contributor Author

Cheappie commented Aug 8, 2022

Looks good to me, thank you 👍

Thank you @tustvold for code review, It helped me a lot to make code nicer.

@tustvold
Copy link
Contributor

tustvold commented Aug 8, 2022

Failing clippy lint, then this can go in

@Cheappie
Copy link
Contributor Author

Cheappie commented Aug 8, 2022

I have added fixes, pr might pass clippy lints now. But It might also fail, because compiler reports that there is something wrong with AsyncFileReader trait.

warning: the trait `AsyncFileReader` cannot be made into an object
warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
116 |     fn get_byte_ranges(
    |        ^^^^^^^^^^^^^^^ the trait cannot be made into an object because method `get_byte_ranges` references the `Self` type in its `where` clause

@tustvold
Copy link
Contributor

tustvold commented Aug 8, 2022

Oops, let me fix that upstream. In the meantime can you add an #[allow] annotation

Edit: Upstream fix apache/arrow-rs#2366

@Cheappie
Copy link
Contributor Author

Cheappie commented Aug 8, 2022

I have added allows in few places, related to where the issue originates and some top levels, but compiler can't get silent about that.

I have tried such allow at the top of the file too
#![allow(where_clauses_object_safety)]

By the way compiler points to parquet crate, parquet-19.0.0/src/arrow/async_reader.rs:116:8 as origin of the error

@tustvold
Copy link
Contributor

tustvold commented Aug 8, 2022

Let me have a play and see if I can find some way to placate clippy

@tustvold
Copy link
Contributor

tustvold commented Aug 8, 2022

So sticking #![allow(where_clauses_object_safety)] at the top of datafusion/core/src/lib.rs is the only way I've found to make this work...

I'll have a chat with @alamb and see what would be the circumstances for a patch release

@tustvold
Copy link
Contributor

tustvold commented Aug 8, 2022

Hacky workaround in Cheappie#1

@alamb if you could give this a once over when you have a moment, this is the least terrible way I have found to make this work...

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

My biggest recommendation is to add a end-to-end example (like https://github.com/apache/arrow-datafusion/blob/master/datafusion-examples/examples/custom_datasource.rs#L160) for two reasons:

  1. Possibly help others use this feature
  2. (More important) ensure that the plumbing (especially the user defined extensions, for example) continues to work. I think this code is likely to undergo some significant churn over the next few months as we push more predicates down closer to parquet and without some end to end test we may end up breaking something inadvertently

cc @thinkharderdev and @yjshen

@@ -401,6 +404,33 @@ fn create_dict_array(
))
}

/// A single file or part of a file that should be read, along with its schema, statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something but I don't see "schema" and "statistics" on this struct

Perhaps we should refer to "extensions" instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description has been copied from PartitionedFile, at first I had similar impression that something is wrong, but after second read actually It doesn't tell that schema and statistics are part of the struct, but rather that these should be read from file along with data

@@ -401,6 +404,33 @@ fn create_dict_array(
))
}

/// A single file or part of a file that should be read, along with its schema, statistics
pub struct FileMeta {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct makes a lot of sense to me

fn create_reader(
&self,
partition_index: usize,
file_meta: FileMeta,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file meta contains user defined extensions too, right? So users can pass information down into their custom readers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly that. But afaik they will be able to achieve that only through custom TableProvider. Because extensions are passed through PartitionedFile into FileMeta, where higher level abstraction like ObjectStore produces ObjectMeta in listing operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense


///
/// BoxedAsyncFileReader has been created to satisfy type requirements of
/// parquet stream builder constructor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary after apache/arrow-rs#2366? If not perhaps we can add a reference here so we can eventually clean it up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache/arrow-rs#2368 is the one that will remove the need for this

@alamb
Copy link
Contributor

alamb commented Aug 9, 2022

Here is another example of a test that ensures user defined plan nodes don't get messed up: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/user_defined_plan.rs (this has saved me several times from breaking them downstream)

@Cheappie
Copy link
Contributor Author

Cheappie commented Aug 9, 2022

I have added test that checks whether data access ops are routed to parquet file reader factory, It seems that It works as expected

@alamb
Copy link
Contributor

alamb commented Aug 9, 2022

I have added test that checks whether data access ops are routed to parquet file reader factory, is it sufficient ?

The only other thing I would suggest is making it a "cargo integration test" by putting it in datafusion/core/tests/custom_parquet_reader.rs or something. The reason is that by being in datafusion/core/src/datasource/file_format/parquet.rs will allow it access to crate local things (so for example, the test will not catch if the necessary traits are accidentally not publicly exposed, which has happened when we have moved code around in the past 🤦 )

}

#[derive(Debug)]
struct InMemoryParquetFileReaderFactory(Arc<dyn ObjectStore>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test looks great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved that test to integration tests, but for users convenience I had to expose two things ParquetFileMetrics and fn fetch_parquet_metadata(...). I have added docs that describe that these things are subjects to change in near future and are exposed for low level integrations. Should we worry about exposing impl details, or having such doc in place that these things might change is enough ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Cheappie -- I think the comments are good enough

In fact it is perhaps good to see what was required to be made pub to use these new APIs. Hopefully it makes downstream integrations easier

@alamb
Copy link
Contributor

alamb commented Aug 9, 2022

Looks like there are some minor merge conflicts and then I think this PR is ready to go. Thanks again @tustvold and @Cheappie

@alamb
Copy link
Contributor

alamb commented Aug 9, 2022

I took the liberty of updating the RAT in e18d26f to fix CI https://github.com/apache/arrow-datafusion/runs/7750069190?check_suite_focus=true

@alamb
Copy link
Contributor

alamb commented Aug 9, 2022

Looking into the Clippy failure....

@Cheappie
Copy link
Contributor Author

Cheappie commented Aug 9, 2022

I took the liberty of updating the RAT in e18d26f to fix CI https://github.com/apache/arrow-datafusion/runs/7750069190?check_suite_focus=true

thanks

Looking into the Clippy failure....

@alamb I forgot to add #![allow(where_clauses_object_safety)] at the top of custom_parquet_reader.rs file, would you like to do that or would you prefer me adding that line ? It is the same problem with AsyncFileLoader as we have in other places, that needs upstream change that has been already addressed by @tustvold here apache/arrow-rs#2372.

@Cheappie
Copy link
Contributor Author

Cheappie commented Aug 9, 2022

this time clippy should be happy, fingers crossed

@alamb
Copy link
Contributor

alamb commented Aug 9, 2022

this time clippy should be happy, fingers crossed

Thank you for sticking with this @Cheappie

@alamb alamb merged commit 9956f80 into apache:master Aug 9, 2022
@ursabot
Copy link

ursabot commented Aug 9, 2022

Benchmark runs are scheduled for baseline = 01202d6 and contender = 9956f80. 9956f80 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants