Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for streaming a large JSON array #526

Closed
wants to merge 1 commit into from

Conversation

17dec
Copy link

@17dec 17dec commented Mar 19, 2019

My recent excursion into EOF error reporting made me realize that my hack (that happens to rely on is_eof()) to incrementally parse a large JSON array isn't the most robust approach, so here's my initial attempt in implementing a proper streaming array parser.

This implements issue #404 by adding an ArrayDeserializer API that mimics the StreamDeserializer. I did investigate extending the StreamDeserializer API itself, but a new struct seemed more natural.

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Can you come up with some other possible ways to expose this? I am concerned that ArrayDeserializer is not good for handling arrays with mixed element type. StreamDeserializer is designed as a convenience for the common case of uniform element type, but mixed streams are supported without it:

let mut de = serde_json::Deserializer::from_reader(...);
let t1 = T1::deserialize(&mut de)?;
let t2 = T2::deserialize(&mut de)?;

I would like to have some story for supporting non-uniform streaming arrays before accepting something like this.

@17dec
Copy link
Author

17dec commented Mar 20, 2019

Thanks for the feedback! I hadn't considered that use case. Unfortunately, that may complicate the API a little bit. Here's what I have in mind:

  • Modify ArrayDeserializer to provide a return type generic next() implementation instead of an Iterator
  • Rename Deserializer::into_array_iter() to into_array() - since it's not an Iterator anymore.
  • Add an ArrayDeserializer::iter() (or into_iter() if I can't get borrowing to work) to still provide a convenient Iterator API for uniform element types. This step is optional, and I'm not sure the convenience is even that important.

How does this sound?

This mimics the StreamDeserializer API and implements issue serde-rs#404. Unlike
the StreamDeserializer, the ArrayDeserializer struct itself does not
keep track of the type of the array's elements, instead the next()
itself is generic to support deserialization of arrays with values of
different types. Unfortunately, this means we can't implement the
Iterator trait.
@17dec
Copy link
Author

17dec commented Mar 25, 2019

Just pushed a new patch which implements points (1) and (2) of my proposal. I've not implemented the separate Iterator object (3) - this API is not likely to be used very often, so keeping the overall API small and simple may be preferable over the bit of extra convenience.

@alexwennerberg
Copy link

Really interested in this PR! Is there anything preventing it from being merged at the moment?

@dtolnay
Copy link
Member

dtolnay commented Jun 25, 2019

I wanted to brainstorm some alternatives but haven't been able to make time yet. Mainly I wonder how this API would be different if we wanted to support streaming nested arrays inside of arrays, arrays inside of objects, objects inside of objects, etc. Almost like a deserialization version of serde_json::ser::Formatter.

Maybe this could be experimented with in its own crate outside of serde_json, then considered for a PR?

@17dec
Copy link
Author

17dec commented Jun 25, 2019

I've considered that use case, and think my proposed API can be extended quite naturally to support it. Fortunately, JSON only has two nested types, so it ought to be pretty simple (albeit a bit crude):

  • Deserializer::into_array() -> ArrayDeserializer
  • Deserializer::into_map() -> ObjectDeserializer
  • ArrayDeserializer::next_array() -> ArrayDeserializer
  • ArrayDeserializer::next_map() -> ObjectDeserializer
  • ObjectDeserializer::next_array() -> ArrayDeserializer
  • ObjectDeserializer::next_map() -> ObjectDeserializer

To make this work, ObjectDeserializer and ArrayDeserializer would need to have a lifetime so that one can't call next() on the originating struct while dealing with a nested array/object.

But I won't have time to work this out in the foreseeable future.

@dtolnay dtolnay added the wip label Jul 1, 2019
@OscarRL
Copy link

OscarRL commented Aug 7, 2019

I've considered that use case, and think my proposed API can be extended quite naturally to support it. Fortunately, JSON only has two nested types, so it ought to be pretty simple (albeit a bit crude):

  • Deserializer::into_array() -> ArrayDeserializer
  • Deserializer::into_map() -> ObjectDeserializer
  • ArrayDeserializer::next_array() -> ArrayDeserializer
  • ArrayDeserializer::next_map() -> ObjectDeserializer
  • ObjectDeserializer::next_array() -> ArrayDeserializer
  • ObjectDeserializer::next_map() -> ObjectDeserializer

To make this work, ObjectDeserializer and ArrayDeserializer would need to have a lifetime so that one can't call next() on the originating struct while dealing with a nested array/object.

But I won't have time to work this out in the foreseeable future.

This one should be the best approach. Actually as our ingestion pipeline we are handling JSON's like the one that you are talking here. But also we have cases where there is a simple BIG Json.

Big Json-Array proposed here:

[
  { .. element.. },
  ... GB of elements...
]

Alternative big JSON-Object used to transfer multiple data with the same file:

{
   "data1": [.... GB of elements...],
   "data2": [.... more GB of elements...]
}

JSON-Lines that can be parsed easy reading lines with a byte buffer and send to the lib.

[element]
[element]
... GB of elements...

Actually the lib always expects that the data fits in memory or it forces to the programmer build a custom deserializer and pop the data in the nested elements using a visitor pattern.

@nwoltman
Copy link

Is it possible to implement the code in this PR outside of serde_json? I need the ArrayDeserializer for a project I'm working on but it uses methods that are marked as private, so it seems I can't use the code in my own crate.

(Sorry if there's an easy solution to this. It's only my second day using Rust.)

@dtolnay
Copy link
Member

dtolnay commented Mar 15, 2020

Closing because it doesn't look like there is still design work ongoing and I don't plan to accept the current design. I would recommend pursuing a design that takes into account #526 (review) and #526 (comment), whether in an external crate that builds on serde_json or in a fork. The examples in #526 (comment) would be a good test case to work toward getting working.

@dtolnay dtolnay closed this Mar 15, 2020
@dtolnay dtolnay removed the wip label Mar 15, 2020
@serde-rs serde-rs locked and limited conversation to collaborators Mar 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants