Document precisely what we validate for skipped values #1640

jkeiser · 2021-06-25T19:51:01Z

We always validate whatever you use, as well as any objects and arrays that are part of the path to it. The idea is, you should never get the wrong value just because something you skipped has an error in it; but there are many errors that won't affect that.

A (partial?) list for documentation of what we do and don't validate for skipped values:

Strings:

We ALWAYS validate the boundaries of strings (that strings are properly closed, taking into account \"). This means that even if a string has some invalid stuff inside, we validate enough that they cannot affect anything else.
We ALWAYS validate string characters (newlines and control characters being disallowed).
We only validate the content of escape characters if you unescape a string (like using get_string() or unescaped_key()). i.e. \p is not allowed, and \u must be followed by hex digits.

Numbers/Booleans/Null:

For the purposes of structure, we ALWAYS treat any sequence of non-whitespace (except { } [ ] : and ,) is a single number/boolean/null. This means we'll treat that -0ab-10 trrrue as two values without a comma between them, whether you actually use the values or not.
We only validate the content of numbers, booleans and null if you actually use them.

Arrays/Objects:

We ALWAYS validate that objects/arrays are closed before continuing to iterate, even if the objects/arrays are skipped.
We only check for missing or extra , or : when you iterate or index an array or object.
We only check that keys are strings when you iterate or index an array or object.
We only validate remaining ,/: in an array or object if you fully iterate it.
We only validate that the closing ] or } matches the opening one if you fully iterate the object/array.

Document:

We ALWAYS validate that the JSON is not empty.
We NEVER check if there is extra JSON after your document.

The text was updated successfully, but these errors were encountered:

jkeiser · 2021-06-25T20:04:12Z

A couple of things here seem like we may want to change in terms of featureset:

We should definitely check if the document has extra crap after the end if you iterate through the whole thing.
We may want to check at least some structure for skipped arrays and objects, because if the user forgot an open or close brace, it could affect how later stuff is interpreted. Matching types for open and close braces come to mind; commas and semicolons between values could also be a candidate. Not sure if objects need to check sequence of ':' ',' or if simply checking <value> ([,:] <value>)* will catch all the cases we can catch.
There will be some cases where a missing brace cannot be detected structurally without validating the remaining structure of the document, even if you stop early! When the user does not fully consume a top-level object or array, we might want to continue on. (Alternatively, there may be some benefit to checking structure in stage 1, which could also simplify iteration and send less information to the frontend.)

lemire · 2021-06-25T20:59:03Z

We should definitely check if the document has extra crap after the end if you iterate through the whole thing.

I remember at least one user who would disagree with that.

lemire · 2021-06-25T21:01:52Z

Alternatively, there may be some benefit to checking structure in stage 1, which could also simplify iteration and send less information to the frontend.

I would be careful about adding too much weight to stage 1. We want to serve well the user that only wants to extract a tiny bit of information in a document that they "know" is correct.

lemire · 2021-06-25T21:02:58Z

@jkeiser What about a 'validate()' method on the document? Now that we have document rewinds... that could be useful. This way, people who don't need it do not have to pay a price for it.

jkeiser · 2021-06-26T01:20:35Z

I actually think we should have a (templated?) version of on demand that validates everything as you go. I think validate() is a good idea too.

My change suggestions here are to try and get closer to the ideal "any errors in unused json that would cause you to get the wrong value later on will be detected."

jkeiser · 2021-06-26T01:30:59Z

Perhaps more succinctly: "simdjson will detect any JSON errors that would change the output of your program."

jkeiser · 2021-06-26T01:32:51Z

I think the default user will want some assurance it is safe. I can imagine a flag that goes the other way, too: "assume the document is valid."

jkeiser changed the title ~~Document precisely what we do and don't validate~~ Document precisely what we validate for skipped values Jun 25, 2021

lemire added this to the 2.0 milestone Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document precisely what we validate for skipped values #1640

Document precisely what we validate for skipped values #1640

jkeiser commented Jun 25, 2021 •

edited

jkeiser commented Jun 25, 2021 •

edited

lemire commented Jun 25, 2021

lemire commented Jun 25, 2021

lemire commented Jun 25, 2021

jkeiser commented Jun 26, 2021

jkeiser commented Jun 26, 2021

jkeiser commented Jun 26, 2021

Document precisely what we validate for skipped values #1640

Document precisely what we validate for skipped values #1640

Comments

jkeiser commented Jun 25, 2021 • edited

jkeiser commented Jun 25, 2021 • edited

lemire commented Jun 25, 2021

lemire commented Jun 25, 2021

lemire commented Jun 25, 2021

jkeiser commented Jun 26, 2021

jkeiser commented Jun 26, 2021

jkeiser commented Jun 26, 2021

jkeiser commented Jun 25, 2021 •

edited

jkeiser commented Jun 25, 2021 •

edited