Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON object stream loading support #520

Open
eugene-bright opened this issue Apr 5, 2022 · 7 comments
Open

JSON object stream loading support #520

eugene-bright opened this issue Apr 5, 2022 · 7 comments

Comments

@eugene-bright
Copy link

eugene-bright commented Apr 5, 2022

What did you do?

Parsing of AWS S3 bucket content that contains aggregated stream of the log objects.
The JSON objects are being written continiously without any delimiters e.g.:

{"msg": "first messge"}{"msg": "second message"}

What did you expect to happen?

I know that it's not a part of the JSON spec, but I expect something like that:

>>> list(ujson.loadstream("{}{}"))
[{}, {}]

What actually happened?

When I'm parsing such an stream the error arises:

>>> ujson.loads("{}{}")
ValueError: Trailing data

What versions are you using?

  • OS: Debian GNU/Linux bookworm/sid
  • Python: Python 3.8.12
  • UltraJSON: 5.1.0
@bwoodsend
Copy link
Collaborator

That doesn't sound too hard. I'd definitely want this to be a separate function like ujson.loadstream() as you suggested and ujson.loads("{}{}") remains an error.

When you say stream though - that seems to hint to me that this multi-json might be coming in chunks and that those chunks may start or stop midway through an object. Is that likely to happen? It would certainly be a lot messier. Would you expect ujson.loadstream("{}{") to raise an error or yield the first object and an indicator that characters input[2:] are still to be processed?

@eugene-bright
Copy link
Author

eugene-bright commented Apr 5, 2022

Thank you for reply, @bwoodsend
In my particular case I work with the single file-like object or string. If someone needs to combine chunks, he/she can make a wrapper over TextIOBase.
It would be good idea to have an iterator that yields objects one by one to save memory and allow parsing of very long streams.

@kibiz0r
Copy link

kibiz0r commented Sep 8, 2022

FWIW @eugene-bright: I ended up using this for a similar situation: https://github.com/rickardp/splitstream

@eugene-bright
Copy link
Author

Thanks for sharing, @kibiz0r
I worked my case around by using custom JSONDecoder implementation

@bwoodsend
Copy link
Collaborator

It does make me wonder why on earth AWS doesn't just write them as

[{"msg": "first messge"}, {"msg": "second message"}]

@eugene-bright
Copy link
Author

With the proper Firehouse configuration it could be possible I believe. But...

@JustAnotherArchivist
Copy link
Collaborator

@bwoodsend The downside is that with such a notation, you will always need to load the entire log into memory for decoding (without special trickery) rather than looping over the entries. Same reason why JSONL exists. Although a separator (such as LF in JSONL, or rarely RS for record-separator-delimited JSON) being omitted makes it a pain again in my opinion. Concatenated JSON has its advantages as well though, in particular you can pretty-print JSON and it'll still work with concatenation, which isn't the case with JSONL for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants