Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with bogus-error approach for no-overhead bound-check-like behavior #1686

Open
lemire opened this issue Aug 4, 2021 · 4 comments
Labels
research Exploration of the unknown

Comments

@lemire
Copy link
Member

lemire commented Aug 4, 2021

The simdjson library is highly optimized. Through clever optimizations, it avoids most bound checks.

There are a few limitations. For example, we require a few bytes of padding at the end of the input (#174). We also refuse to parse a single JSON document that exceeds 4 GB (#128).

To get around this, we have an outstanding PR #1665 which undoes these clever optimizations, adds regular bound checking, and lower the performance somewhat, but also allows you to lift the padding requirement.

A more daring approach would not to not go back to conventional bound checking and, instead, push forward with our clever bound-free approach. Instead of doing all of these bound checks all over the place... examine the document when we get started, adjust the structural index so that at a strategic location you get a bogus error. This bogus error brings you into a distinct mode where you finish the processing with more careful code. Then you'd get the no-padding for free (given a large enough input).

This "bogus error" approach is also how I would try to handle the "stage 1 in chunks". You give me a 6 GB JSON document. I index it in chunks of 1 MB. I change the index so that somewhere before the end of the chunk, I encounter a bogus error. Then I know to load a new index.

This would be a bit challenging, for sure. And it would require that we maintain a slow path with bound checking at times. The latter could be achieved with templates, maybe.

cc @jkeiser

@lemire lemire added the research Exploration of the unknown label Aug 4, 2021
@lemire lemire added this to the 2.0 milestone Aug 4, 2021
@jkeiser
Copy link
Member

jkeiser commented Aug 5, 2021

This is definitely a great approach for DOM. The big difficulty is On Demand: because the user drives the parsing (which is unlined), I can't see a way to "switch modes" without embedding if (slowmode) statements everywhere.

@lemire
Copy link
Member Author

lemire commented Aug 5, 2021

@jkeiser This is marked as research, so I do not know whether it could work or how practical it would be, but consider this.

So everything works as we do now... then you hit an error. At this point, we have extra error handling code that checks whether it is a bogus error, if it is then we resume parsing with a different code path (slow?). The user never sees the bogus error, it is intercepted by the system before it gets to the user... There is a slight delay while the system "repairs" itself.

(Note: I do invite objections... that's the point of a 'research' idea: it can be wrong.)

@lemire
Copy link
Member Author

lemire commented Aug 5, 2021

The slow path may not require different code: we could simply copy the input to a temporary buffer. So you just temporarily switch the input buffer. This would be done only if the input was not already padded.

@lemire
Copy link
Member Author

lemire commented Aug 5, 2021

Of course, copying the last segment to a temporary padded buffer would not be free but it would be a constant-time cost that could easily amortized over big inputs. And the temporary buffer might be always small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research Exploration of the unknown
Projects
None yet
Development

No branches or pull requests

2 participants