Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added support for JSON ser/de records layout #1275

Merged
merged 21 commits into from Oct 30, 2022

Conversation

AnIrishDuck
Copy link
Contributor

@AnIrishDuck AnIrishDuck commented Oct 13, 2022

Prior discussion in #1178

Serialization was easy: individual streaming iterators already produce a record at a time, so we just need to salami slice them to transpose the results.

Deserialization was more complex:

  • Define Preallocate for arrays that have a backing that can be opportunistically resized.
    • Add expand to MutableListArray. This is the workhorse that makes later arbitrarily-nested deserialization possible.
    • Declare an impl MutableArray for Box<dyn MutableArray>. This is mostly trivial delegation, so it may be worth investigating the delegate crate for boilerplate reduction. We need this to support arbitrary recursion at the type level. This allows us to define a MutableListArray<O, Box<dyn MutableArray>>.
    • Finally, implement schema inference and deserialization. Here be dragons:
      • Schema inference basically recurses down to the first element for each list, and assumes that element and array shape for all future records. There are obvious pitfalls here.
      • For deserialization, we want to avoid an expensive transpose. So we instead allocate mutable arrays for each record in the schema, and expand any nested list arrays recursively after filling the recursive sub-level. This requires converting a bunch of deserialize_xxx methods into deserialize_xxx_into methods that extend an existing array instead of returning a new array. We can generally use adapters like fill_array_from to avoid code duplication in the original case where we just need a plain list returned.

It's worth noting that this approach may be useful in general to avoid some of the allocations that the json deserializer currently performs (creating Vec of row references).

@codecov
Copy link

codecov bot commented Oct 20, 2022

Codecov Report

Base: 83.04% // Head: 83.13% // Increases project coverage by +0.08% 🎉

Coverage data is based on head (06757db) compared to base (27e109d).
Patch coverage: 77.59% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1275      +/-   ##
==========================================
+ Coverage   83.04%   83.13%   +0.08%     
==========================================
  Files         363      363              
  Lines       38442    38884     +442     
==========================================
+ Hits        31926    32326     +400     
- Misses       6516     6558      +42     
Impacted Files Coverage Δ
src/array/mod.rs 64.93% <50.00%> (-3.62%) ⬇️
src/io/json/read/infer_schema.rs 93.15% <74.28%> (-2.33%) ⬇️
src/io/json/read/deserialize.rs 74.52% <74.35%> (+1.96%) ⬆️
src/array/list/mutable.rs 76.14% <86.04%> (+3.98%) ⬆️
src/io/json/write/serialize.rs 92.33% <88.88%> (-0.40%) ⬇️
src/array/fixed_size_list/mutable.rs 67.93% <100.00%> (+18.72%) ⬆️
src/io/json/write/mod.rs 98.59% <100.00%> (+5.73%) ⬆️
src/compute/cast/mod.rs 93.54% <0.00%> (-0.51%) ⬇️
src/io/ipc/read/schema.rs 95.29% <0.00%> (-0.30%) ⬇️
src/io/ipc/read/file.rs 96.87% <0.00%> (+0.44%) ⬆️
... and 10 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@AnIrishDuck
Copy link
Contributor Author

Thoughts on using match-downcast to get rid of the if-else ugliness? Clippy just reminded me that it is indeed, quite gross.

@ritchie46
Copy link
Collaborator

Thoughts on using match-downcast to get rid of the if-else ugliness? Clippy just reminded me that it is indeed, quite gross.

I don't think that's worth another crate pulled in.

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready great

My only major comment is to keep things private - i.e. keep all changes needed for the json private to this crate, since they are only used in the context of the crate.

Great PR!!

src/array/binary/mutable.rs Outdated Show resolved Hide resolved
src/array/list/mutable.rs Outdated Show resolved Hide resolved
src/array/mod.rs Show resolved Hide resolved
src/io/json/read/deserialize.rs Outdated Show resolved Hide resolved
tests/it/io/json/read.rs Outdated Show resolved Hide resolved
};

// No idea why assert_eq! doesn't work here, but this does.
assert_eq!(format!("{:?}", expected), format!("{:?}", actual));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that this is because line 144 uses false on the nullability of the inner field, but the result is a nullability true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this was it. Kinda an interesting philosophical question on whether they are actually equal... value-wise, they are identical, even though one could hold nullable values. While I'm not 100% sure I agree with the conclusion, I understand and respect the reasoning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last thought I'll leave here is that the error when they are not equal is pretty confusing:

thread 'io::json::read::read_json_records' panicked at 'assertion failed: `(left == right)`
  left: `ListArray[[[1.1, 2, 3], [2, 3], [4, 5, 6]], [[3, 2, 1], [3, 2], [6, 5, 4]]]`,
 right: `ListArray[[[1.1, 2, 3], [2, 3], [4, 5, 6]], [[3, 2, 1], [3, 2], [6, 5, 4]]]`', tests/it/io/json/read.rs:114:9

Not sure if the solution is to somehow mark that one is nullable and one is not.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I was also bitten by this many times. The problem challenge is that the ListArray can become pretty complex if we expose the inner field in the debug :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that makes sense. Especially with deeply nested types.

tests/it/io/json/write.rs Outdated Show resolved Hide resolved
src/array/list/mutable.rs Outdated Show resolved Hide resolved
src/array/list/mutable.rs Outdated Show resolved Hide resolved
src/array/list/mutable.rs Outdated Show resolved Hide resolved
AnIrishDuck and others added 3 commits October 21, 2022 12:13
Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
src/array/mod.rs Outdated Show resolved Hide resolved
@jorgecarleitao jorgecarleitao changed the title Support for pandas records ser/de Added support for JSON ser/de records layout Oct 30, 2022
@jorgecarleitao jorgecarleitao merged commit cd985d4 into jorgecarleitao:main Oct 30, 2022
@jorgecarleitao jorgecarleitao added the feature A new feature label Oct 30, 2022
@jorgecarleitao
Copy link
Owner

Thank you so much! Awesome feature and PR! 🙇

ritchie46 pushed a commit to ritchie46/arrow2 that referenced this pull request Nov 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants