You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The simdjson library is highly optimized. Through clever optimizations, it avoids most bound checks.
There are a few limitations. For example, we require a few bytes of padding at the end of the input (#174). We also refuse to parse a single JSON document that exceeds 4 GB (#128).
To get around this, we have an outstanding PR #1665 which undoes these clever optimizations, adds regular bound checking, and lower the performance somewhat, but also allows you to lift the padding requirement.
A more daring approach would not to not go back to conventional bound checking and, instead, push forward with our clever bound-free approach. Instead of doing all of these bound checks all over the place... examine the document when we get started, adjust the structural index so that at a strategic location you get a bogus error. This bogus error brings you into a distinct mode where you finish the processing with more careful code. Then you'd get the no-padding for free (given a large enough input).
This "bogus error" approach is also how I would try to handle the "stage 1 in chunks". You give me a 6 GB JSON document. I index it in chunks of 1 MB. I change the index so that somewhere before the end of the chunk, I encounter a bogus error. Then I know to load a new index.
This would be a bit challenging, for sure. And it would require that we maintain a slow path with bound checking at times. The latter could be achieved with templates, maybe.
This is definitely a great approach for DOM. The big difficulty is On Demand: because the user drives the parsing (which is unlined), I can't see a way to "switch modes" without embedding if (slowmode) statements everywhere.
@jkeiser This is marked as research, so I do not know whether it could work or how practical it would be, but consider this.
So everything works as we do now... then you hit an error. At this point, we have extra error handling code that checks whether it is a bogus error, if it is then we resume parsing with a different code path (slow?). The user never sees the bogus error, it is intercepted by the system before it gets to the user... There is a slight delay while the system "repairs" itself.
(Note: I do invite objections... that's the point of a 'research' idea: it can be wrong.)
The slow path may not require different code: we could simply copy the input to a temporary buffer. So you just temporarily switch the input buffer. This would be done only if the input was not already padded.
Of course, copying the last segment to a temporary padded buffer would not be free but it would be a constant-time cost that could easily amortized over big inputs. And the temporary buffer might be always small.
The simdjson library is highly optimized. Through clever optimizations, it avoids most bound checks.
There are a few limitations. For example, we require a few bytes of padding at the end of the input (#174). We also refuse to parse a single JSON document that exceeds 4 GB (#128).
To get around this, we have an outstanding PR #1665 which undoes these clever optimizations, adds regular bound checking, and lower the performance somewhat, but also allows you to lift the padding requirement.
A more daring approach would not to not go back to conventional bound checking and, instead, push forward with our clever bound-free approach. Instead of doing all of these bound checks all over the place... examine the document when we get started, adjust the structural index so that at a strategic location you get a bogus error. This bogus error brings you into a distinct mode where you finish the processing with more careful code. Then you'd get the no-padding for free (given a large enough input).
This "bogus error" approach is also how I would try to handle the "stage 1 in chunks". You give me a 6 GB JSON document. I index it in chunks of 1 MB. I change the index so that somewhere before the end of the chunk, I encounter a bogus error. Then I know to load a new index.
This would be a bit challenging, for sure. And it would require that we maintain a slow path with bound checking at times. The latter could be achieved with templates, maybe.
cc @jkeiser
The text was updated successfully, but these errors were encountered: