Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document precisely what we validate for skipped values #1640

Open
jkeiser opened this issue Jun 25, 2021 · 7 comments
Open

Document precisely what we validate for skipped values #1640

jkeiser opened this issue Jun 25, 2021 · 7 comments

Comments

@jkeiser
Copy link
Member

jkeiser commented Jun 25, 2021

We always validate whatever you use, as well as any objects and arrays that are part of the path to it. The idea is, you should never get the wrong value just because something you skipped has an error in it; but there are many errors that won't affect that.

A (partial?) list for documentation of what we do and don't validate for skipped values:

Strings:

  • We ALWAYS validate the boundaries of strings (that strings are properly closed, taking into account \"). This means that even if a string has some invalid stuff inside, we validate enough that they cannot affect anything else.
  • We ALWAYS validate string characters (newlines and control characters being disallowed).
  • We only validate the content of escape characters if you unescape a string (like using get_string() or unescaped_key()). i.e. \p is not allowed, and \u must be followed by hex digits.

Numbers/Booleans/Null:

  • For the purposes of structure, we ALWAYS treat any sequence of non-whitespace (except { } [ ] : and ,) is a single number/boolean/null. This means we'll treat that -0ab-10 trrrue as two values without a comma between them, whether you actually use the values or not.
  • We only validate the content of numbers, booleans and null if you actually use them.

Arrays/Objects:

  • We ALWAYS validate that objects/arrays are closed before continuing to iterate, even if the objects/arrays are skipped.
  • We only check for missing or extra , or : when you iterate or index an array or object.
  • We only check that keys are strings when you iterate or index an array or object.
  • We only validate remaining ,/: in an array or object if you fully iterate it.
  • We only validate that the closing ] or } matches the opening one if you fully iterate the object/array.

Document:

  • We ALWAYS validate that the JSON is not empty.
  • We NEVER check if there is extra JSON after your document.
@jkeiser jkeiser changed the title Document precisely what we do and don't validate Document precisely what we validate for skipped values Jun 25, 2021
@jkeiser
Copy link
Member Author

jkeiser commented Jun 25, 2021

A couple of things here seem like we may want to change in terms of featureset:

  • We should definitely check if the document has extra crap after the end if you iterate through the whole thing.
  • We may want to check at least some structure for skipped arrays and objects, because if the user forgot an open or close brace, it could affect how later stuff is interpreted. Matching types for open and close braces come to mind; commas and semicolons between values could also be a candidate. Not sure if objects need to check sequence of ':' ',' or if simply checking <value> ([,:] <value>)* will catch all the cases we can catch.
  • There will be some cases where a missing brace cannot be detected structurally without validating the remaining structure of the document, even if you stop early! When the user does not fully consume a top-level object or array, we might want to continue on. (Alternatively, there may be some benefit to checking structure in stage 1, which could also simplify iteration and send less information to the frontend.)

@lemire
Copy link
Member

lemire commented Jun 25, 2021

We should definitely check if the document has extra crap after the end if you iterate through the whole thing.

I remember at least one user who would disagree with that.

@lemire
Copy link
Member

lemire commented Jun 25, 2021

Alternatively, there may be some benefit to checking structure in stage 1, which could also simplify iteration and send less information to the frontend.

I would be careful about adding too much weight to stage 1. We want to serve well the user that only wants to extract a tiny bit of information in a document that they "know" is correct.

@lemire
Copy link
Member

lemire commented Jun 25, 2021

@jkeiser What about a 'validate()' method on the document? Now that we have document rewinds... that could be useful. This way, people who don't need it do not have to pay a price for it.

@jkeiser
Copy link
Member Author

jkeiser commented Jun 26, 2021

I actually think we should have a (templated?) version of on demand that validates everything as you go. I think validate() is a good idea too.

My change suggestions here are to try and get closer to the ideal "any errors in unused json that would cause you to get the wrong value later on will be detected."

@jkeiser
Copy link
Member Author

jkeiser commented Jun 26, 2021

Perhaps more succinctly: "simdjson will detect any JSON errors that would change the output of your program."

@jkeiser
Copy link
Member Author

jkeiser commented Jun 26, 2021

I think the default user will want some assurance it is safe. I can imagine a flag that goes the other way, too: "assume the document is valid."

@lemire lemire added this to the 2.0 milestone Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants