Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept parquet schemas without explicitly required Map keys #5630

Conversation

jupiter
Copy link
Contributor

@jupiter jupiter commented Apr 11, 2024

Which issue does this PR close?

Closes #5606.

Rationale for this change

The check is superfluous and restricts reading of a huge number of files produced without Map keys explicitly marked as required.

What changes are included in this PR?

Reduces the check to only error on a value that would have to have been explicitly set to an invalid value.

Are there any user-facing changes?

None found

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 11, 2024
@tustvold
Copy link
Contributor

tustvold commented Apr 11, 2024

Can we add a test for this, I suspect the dremel shredding will not work correctly with just this change.

Perhaps you could get such a file added to the parquet-testing repo so that an integration test can be written

@jupiter
Copy link
Contributor Author

jupiter commented Apr 12, 2024

Ensuring that the key will always be non-nullable in Arrow, and added a test in the module, following the same pattern as other schema conversions.

I can can provide a test file, but not sure how integration tests are written, and I'm not seeing any reference to the test files, other than in the Readme. Should I just add the file 'parquet-testing/data' and add an entry in the Readme?

@tustvold
Copy link
Contributor

https://github.com/apache/parquet-testing/pull/45/files would be a good example

@jupiter
Copy link
Contributor Author

jupiter commented Apr 15, 2024

Also see apache/parquet-testing#47 (comment)

@jupiter jupiter marked this pull request as ready for review April 15, 2024 11:19
@tustvold
Copy link
Contributor

I would like to see if we can get apache/parquet-testing#47 in first so that we can get a full end-to-end test here. If things drag out though we can move ahead as is

@tustvold
Copy link
Contributor

apache/parquet-testing#47 was merged last week, so I think it should just be a case of updating this PR with an integration test based off that

@jupiter
Copy link
Contributor Author

jupiter commented Apr 26, 2024

How does that look? I've added a test in what I think is an appropriate place, with similar tests. Let me know if there's something specific we should add to the test.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to bump the relevant submodule too in order to get that new test file

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just need to fix clippy then should be good to go?

parquet/src/arrow/schema/complex.rs Outdated Show resolved Hide resolved
@tustvold
Copy link
Contributor

Thank you for sticking with this 👍

@tustvold tustvold merged commit 6348dc3 into apache:master Apr 30, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot read Parquet files that do not specify Map keys as required
3 participants