Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add full data validation for ArrayData::try_new() #921

Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 4, 2021

Which issue does this PR close?

Closes #817

Rationale for this change

Improve the security arrow-rs by validating untrusted user input, and resolves several outstanding potential security issues by doing deep checks of the data passed to ArrayData::try_new()

What changes are included in this PR?

  1. Add full validation to ArrayData::try_new() based on the checks in the C++ implementation, kindly pointed out (and contributed) by @pitrou

Are there any user-facing changes?

  • ArrayData::try_new() may return an Err now on some (invalid) inputs rather than succeeding and will take more time.
  • There is a new pub ArrayData::validate_full() that performs full ArrayData validation

Testing

Follow On:

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 4, 2021
@alamb alamb force-pushed the alamb/full_array_data_construction_validation branch 4 times, most recently from fda948d to f968803 Compare November 8, 2021 19:19
@codecov-commenter
Copy link

codecov-commenter commented Nov 8, 2021

Codecov Report

Merging #921 (2e6f3af) into master (9fb2a5f) will increase coverage by 0.01%.
The diff coverage is 86.78%.

❗ Current head 2e6f3af differs from pull request most recent head 4436472. Consider uploading reports for the commit 4436472 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #921      +/-   ##
==========================================
+ Coverage   82.29%   82.31%   +0.01%     
==========================================
  Files         168      168              
  Lines       48761    49031     +270     
==========================================
+ Hits        40129    40358     +229     
- Misses       8632     8673      +41     
Impacted Files Coverage Δ
arrow/src/array/data.rs 81.16% <85.60%> (+1.37%) ⬆️
arrow/src/array/array_binary.rs 93.55% <100.00%> (+0.13%) ⬆️
arrow/src/array/array_dictionary.rs 88.75% <100.00%> (+0.28%) ⬆️
arrow/src/array/array_string.rs 97.08% <100.00%> (-0.83%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.46%) ⬇️
arrow/src/array/transform/mod.rs 85.10% <0.00%> (-0.14%) ⬇️
parquet/src/encodings/encoding.rs 93.71% <0.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fb2a5f...4436472. Read the comment docs.

@alamb alamb force-pushed the alamb/full_array_data_construction_validation branch 2 times, most recently from 30710a1 to 74b74dd Compare November 23, 2021 21:37
@alamb
Copy link
Contributor Author

alamb commented Nov 23, 2021

Update: I have a bunch of tests written, so that is good 🎉

I think there is some more testing needed here as when I tried to apply validate_full to every ArrayData construction I found some corner cases (like what if an offset array is marked as null? the spec I think says this is ok, but I think the C++ errors, which seems sensible to me)

@alamb alamb force-pushed the alamb/full_array_data_construction_validation branch from c416ad1 to 9d97916 Compare November 30, 2021 19:58
@alamb alamb changed the title (WIP) Add full data validation for ArrayData::try_new() Add full data validation for ArrayData::try_new() Nov 30, 2021
@alamb
Copy link
Contributor Author

alamb commented Nov 30, 2021

This PR (that has full ArrayData validate) is now ready for review . I hope it allows us to close the currently open RUSTSEC entries for arrow-rs: https://github.com/rustsec/advisory-db/tree/main/crates/arrow

cc @jhorstmann @bjchambers @Jimexist @jorgecarleitao

I know it is a large ask to review this PR, so thank you if you have time to do so.

As for testing, if you have a project that uses arrow 6.x.0 perhaps you would be willing to test it using the method in apache/datafusion#1386 which would be very helpful.

arrow/src/array/data.rs Outdated Show resolved Hide resolved
alamb and others added 4 commits December 1, 2021 16:58
@alamb alamb merged commit 5806d29 into apache:master Dec 4, 2021
@alamb
Copy link
Contributor Author

alamb commented Dec 4, 2021

Thanks again everyone for the help. I'll backport this PR and hopefully we can get it shipped in 6.4.0

@alamb alamb deleted the alamb/full_array_data_construction_validation branch December 4, 2021 11:44
alamb added a commit that referenced this pull request Dec 4, 2021
* Add full data validation for ArrayData::try_new()

* Only look at offset+len indexes

Co-authored-by: Jörn Horstmann <git@jhorstmann.net>

* fix test

* fmt

* test for array indexes

Co-authored-by: Jörn Horstmann <git@jhorstmann.net>
@alamb alamb added the cherry-picked PR that was backported to active release (will be included in maintenance release) label Dec 4, 2021
alamb added a commit that referenced this pull request Dec 5, 2021
* Add full data validation for ArrayData::try_new()

* Only look at offset+len indexes

Co-authored-by: Jörn Horstmann <git@jhorstmann.net>

* fix test

* fmt

* test for array indexes

Co-authored-by: Jörn Horstmann <git@jhorstmann.net>
alamb added a commit that referenced this pull request Dec 6, 2021
* Add full data validation for ArrayData::try_new() (#921)

* Add full data validation for ArrayData::try_new()

* Only look at offset+len indexes

Co-authored-by: Jörn Horstmann <git@jhorstmann.net>

* fix test

* fmt

* test for array indexes

Co-authored-by: Jörn Horstmann <git@jhorstmann.net>

* Fix: clippy

Co-authored-by: Jörn Horstmann <git@jhorstmann.net>
pitrou added a commit to apache/arrow that referenced this pull request Jan 4, 2022
…ike arrays

# Rationale
The question of "what are the values of the offsets for non-valid entries in arrays" came up in arrow-rs: apache/arrow-rs#1071 and the existing [docs](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) seem to be somewhat vague on this issue.

I looked at three implementations of arrow, and they all seem to assume / validate the offsets are monotonic:
* C++ implementation (I think) also also ensures the offsets are monotonic without first checking the validity array https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L568-L592
* arrow-rs after apache/arrow-rs#921 (based on the C++) will refuse to create arrays where the array offsets are non monotonic
* arrow2 also ensures that offsets are always monotonic.
https://github.com/jorgecarleitao/arrow2/blob/37a9c758826a92d98dc91e992b2a49ce9724095d/src/array/specification.rs#L102-L119

# Changes
Thus I propose updating the format docs to make the monotonic offsets explicit.

# Background
I think @jorgecarleitao's description on  apache/arrow-rs#1071 (comment), explains the reason why having monotonic offsets is a good idea

> I think that in general the property we seek is: discarding the validity cannot result in UB when accessing the values. This justifies the values buffer of a primitive array is always initialized, and the offsets being valid and in-bounds even in null cases.
>
> The rational for this is that sometimes it is faster to skip validity accesses and only iterate over the values (and clone the validity). I do not recall the benchmark result, but this may explain why string comparison ignores validity and & the bitmaps instead.

Closes #12019 from alamb/alamb/clarify_offsets

Lead-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate cherry-picked PR that was backported to active release (will be included in maintenance release)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate arguments to ArrayData::try_new()
4 participants