JSON object stream loading support #520

eugene-bright · 2022-04-05T10:45:47Z

What did you do?

Parsing of AWS S3 bucket content that contains aggregated stream of the log objects.
The JSON objects are being written continiously without any delimiters e.g.:

{"msg": "first messge"}{"msg": "second message"}

What did you expect to happen?

I know that it's not a part of the JSON spec, but I expect something like that:

>>> list(ujson.loadstream("{}{}"))
[{}, {}]

What actually happened?

When I'm parsing such an stream the error arises:

>>> ujson.loads("{}{}")
ValueError: Trailing data

What versions are you using?

OS: Debian GNU/Linux bookworm/sid
Python: Python 3.8.12
UltraJSON: 5.1.0

bwoodsend · 2022-04-05T16:58:16Z

That doesn't sound too hard. I'd definitely want this to be a separate function like ujson.loadstream() as you suggested and ujson.loads("{}{}") remains an error.

When you say stream though - that seems to hint to me that this multi-json might be coming in chunks and that those chunks may start or stop midway through an object. Is that likely to happen? It would certainly be a lot messier. Would you expect ujson.loadstream("{}{") to raise an error or yield the first object and an indicator that characters input[2:] are still to be processed?

eugene-bright · 2022-04-05T17:58:57Z

Thank you for reply, @bwoodsend
In my particular case I work with the single file-like object or string. If someone needs to combine chunks, he/she can make a wrapper over TextIOBase.
It would be good idea to have an iterator that yields objects one by one to save memory and allow parsing of very long streams.

kibiz0r · 2022-09-08T14:36:58Z

FWIW @eugene-bright: I ended up using this for a similar situation: https://github.com/rickardp/splitstream

eugene-bright · 2022-09-08T16:49:24Z

Thanks for sharing, @kibiz0r
I worked my case around by using custom JSONDecoder implementation

bwoodsend · 2022-09-08T17:02:24Z

It does make me wonder why on earth AWS doesn't just write them as

[{"msg": "first messge"}, {"msg": "second message"}]

eugene-bright · 2022-09-08T17:20:49Z

With the proper Firehouse configuration it could be possible I believe. But...

JustAnotherArchivist · 2022-09-08T19:36:56Z

@bwoodsend The downside is that with such a notation, you will always need to load the entire log into memory for decoding (without special trickery) rather than looping over the entries. Same reason why JSONL exists. Although a separator (such as LF in JSONL, or rarely RS for record-separator-delimited JSON) being omitted makes it a pain again in my opinion. Concatenated JSON has its advantages as well though, in particular you can pretty-print JSON and it'll still work with concatenation, which isn't the case with JSONL for example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON object stream loading support #520

JSON object stream loading support #520

eugene-bright commented Apr 5, 2022 •

edited by hugovk

bwoodsend commented Apr 5, 2022

eugene-bright commented Apr 5, 2022 •

edited

kibiz0r commented Sep 8, 2022

eugene-bright commented Sep 8, 2022

bwoodsend commented Sep 8, 2022

eugene-bright commented Sep 8, 2022

JustAnotherArchivist commented Sep 8, 2022

JSON object stream loading support #520

JSON object stream loading support #520

Comments

eugene-bright commented Apr 5, 2022 • edited by hugovk

What did you do?

What did you expect to happen?

What actually happened?

What versions are you using?

bwoodsend commented Apr 5, 2022

eugene-bright commented Apr 5, 2022 • edited

kibiz0r commented Sep 8, 2022

eugene-bright commented Sep 8, 2022

bwoodsend commented Sep 8, 2022

eugene-bright commented Sep 8, 2022

JustAnotherArchivist commented Sep 8, 2022

eugene-bright commented Apr 5, 2022 •

edited by hugovk

eugene-bright commented Apr 5, 2022 •

edited