Experiment with bogus-error approach for no-overhead bound-check-like behavior #1686

lemire · 2021-08-04T15:09:01Z

The simdjson library is highly optimized. Through clever optimizations, it avoids most bound checks.

There are a few limitations. For example, we require a few bytes of padding at the end of the input (#174). We also refuse to parse a single JSON document that exceeds 4 GB (#128).

To get around this, we have an outstanding PR #1665 which undoes these clever optimizations, adds regular bound checking, and lower the performance somewhat, but also allows you to lift the padding requirement.

A more daring approach would not to not go back to conventional bound checking and, instead, push forward with our clever bound-free approach. Instead of doing all of these bound checks all over the place... examine the document when we get started, adjust the structural index so that at a strategic location you get a bogus error. This bogus error brings you into a distinct mode where you finish the processing with more careful code. Then you'd get the no-padding for free (given a large enough input).

This "bogus error" approach is also how I would try to handle the "stage 1 in chunks". You give me a 6 GB JSON document. I index it in chunks of 1 MB. I change the index so that somewhere before the end of the chunk, I encounter a bogus error. Then I know to load a new index.

This would be a bit challenging, for sure. And it would require that we maintain a slow path with bound checking at times. The latter could be achieved with templates, maybe.

cc @jkeiser

jkeiser · 2021-08-05T13:22:01Z

This is definitely a great approach for DOM. The big difficulty is On Demand: because the user drives the parsing (which is unlined), I can't see a way to "switch modes" without embedding if (slowmode) statements everywhere.

lemire · 2021-08-05T13:38:30Z

@jkeiser This is marked as research, so I do not know whether it could work or how practical it would be, but consider this.

So everything works as we do now... then you hit an error. At this point, we have extra error handling code that checks whether it is a bogus error, if it is then we resume parsing with a different code path (slow?). The user never sees the bogus error, it is intercepted by the system before it gets to the user... There is a slight delay while the system "repairs" itself.

(Note: I do invite objections... that's the point of a 'research' idea: it can be wrong.)

lemire · 2021-08-05T14:49:50Z

The slow path may not require different code: we could simply copy the input to a temporary buffer. So you just temporarily switch the input buffer. This would be done only if the input was not already padded.

lemire · 2021-08-05T14:51:35Z

Of course, copying the last segment to a temporary padded buffer would not be free but it would be a constant-time cost that could easily amortized over big inputs. And the temporary buffer might be always small.

lemire added the research Exploration of the unknown label Aug 4, 2021

lemire added this to the 2.0 milestone Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with bogus-error approach for no-overhead bound-check-like behavior #1686

Experiment with bogus-error approach for no-overhead bound-check-like behavior #1686

lemire commented Aug 4, 2021

jkeiser commented Aug 5, 2021

lemire commented Aug 5, 2021

lemire commented Aug 5, 2021

lemire commented Aug 5, 2021

Experiment with bogus-error approach for no-overhead bound-check-like behavior #1686

Experiment with bogus-error approach for no-overhead bound-check-like behavior #1686

Comments

lemire commented Aug 4, 2021

jkeiser commented Aug 5, 2021

lemire commented Aug 5, 2021

lemire commented Aug 5, 2021

lemire commented Aug 5, 2021