Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify JSON5 for non-ECMAScript developers #147

Closed
d-frey opened this issue Jun 3, 2017 · 19 comments
Closed

Clarify JSON5 for non-ECMAScript developers #147

d-frey opened this issue Jun 3, 2017 · 19 comments

Comments

@d-frey
Copy link

d-frey commented Jun 3, 2017

It would be great if you could clarify a few points on the web-page which are not obvious.

  • Can multi-line comments be nested? (I assume they may not be nested)
  • It says numbers may be prefixed by +. Does this also apply to NaN and Infinity? (I assume yes, +NaN and +Infinity are valid)
  • Numbers may have leading or trailing decimal separator. Is it allowed to have both leading and trailing decimal separator? Meaning: Is . a valid number? (I assume it is not)
  • Strings can now be single-quoted. What escaped characters are allowed in single-quoted strings? \'? Is \" allowed in a single-quoted string? And is \' now allowed in a double-quoted string? Any other extensions? EDIT: JSON allows an escape slash: \/, does ECMAScript? What about \v (not in JSON, but in ECMAScript).
  • Object keys are, in JSON, strings. Can a multi-line string be used as a key in JSON5?
  • Unquoted object keys - is there any chance you can limit this to an explicitly specified character set without relying on some unicode support? Our library (https://github.com/taocpp/json) tries to be dependency-free and having to deal with unicode character properties seems to go against the simplicity of JSON (and hopefully JSON5). In fact, I'd consider alnums + _ + $ (without a leading cipher) to be good enough, if you want throw in more characters explicitly like -, ., ... but Unicode is way too much IMHO.
@jordanbtucker
Copy link
Member

jordanbtucker commented Jun 3, 2017

It would be great if you could clarify a few points on the web-page which are not obvious.

  • Can multi-line comments be nested? (I assume they may not be nested)
    • No.
  • It says numbers may be prefixed by +. Does this also apply to NaN and Infinity? (I assume yes, +NaN and +Infinity are valid)
    • Yes.
  • Numbers may have leading or trailing decimal separator. Is it allowed to have both leading and trailing decimal separator? Meaning: Is . a valid number? (I assume it is not)
    • No.
  • Strings can now be single-quoted. What escaped characters are allowed in single-quoted strings? \'? Is \" allowed in a single-quoted string? And is \' now allowed in a double-quoted string? Any other extensions? EDIT: JSON allows an escape slash: \/, does ECMAScript? What about \v (not in JSON, but in ECMAScript).
    • In strings, the following characters have special meaning when preceded by \: b, f, n, r, t, u, \n (U+00A0, line feed), \r (U+00D0, carriage return). When the following characters are preceded by \, they are treated as if they weren't: ', ", \, /. (Meaning "\'" is the same as "'".) This applies to both single and double quoted strings.
  • Object keys are, in JSON, strings. Can a multi-line string be used as a key in JSON5?
    • Yes.
  • Unquoted object keys - is there any chance you can limit this to an explicitly specified character set without relying on some unicode support? Our library (https://github.com/taocpp/json) tries to be dependency-free and having to deal with unicode character properties seems to go against the simplicity of JSON (and hopefully JSON5). In fact, I'd consider alnums + _ + $ (without a leading cipher) to be good enough, if you want throw in more characters explicitly like -, ., ... but Unicode is way too much IMHO.
    • Currently JSON5 only supports unquoted property names that follow the regex pattern /[$_A-Za-z][$_A-Za-z0-9]*/. To better align with ECMAScript 5, support for certain alphanumeric Unicode characters has been proposed. (- and . would not be included.) For an example of this implementation, see jsonext, which aligns with ECMAScript 6. Its build.js file generates regular expressions that match the Unicode characters allowed in property names.

@d-frey
Copy link
Author

d-frey commented Jun 3, 2017

Thanks, this seems quite reasonable and I think my implementation in our library is now complete.

I can only hope that the allowed identifiers will stay as they are, I think it is quite sufficient and anything beyond /[$_A-Za-z][$_A-Za-z0-9]*/ can surely go into an ordinary string :)

@d-frey d-frey closed this as completed Jun 3, 2017
@d-frey
Copy link
Author

d-frey commented Jun 8, 2017

One more question: I noticed that ECMAScript allows multiple zeroes (and JSON5's Grammar refers to it). Is this a JSON5 extension or not? Also, would you accept a closed grammar for JSON5 if I write one? (I basically have it already, I just need to convert it from PEGTL-format to an ABNF-like syntax) I think it would make sense to explicitly define the complete and unambiguous JSON5 grammar.

@jordanbtucker
Copy link
Member

Short answer: No.

Long answer:

In ECMAScript 5, a leading zero indicates an octal number if it is immediately followed by one or more numeric digits. In the literal 00, the first 0 is treated as an indicator that the number is octal (base 8) and the second 0 is treated as the actual octal number, which results in 0.

If you try this in ECMAScript 5 in strict mode, you'll get a syntax error indicating that octal literals are not allowed. And regardless of whether you're in strict mode, you'll get a syntax error if you try things like 00.0 or 00e0.

Like strict ES5, neither JSON nor JSON5 allow octal literals, so 00 is an invalid token. Interestingly in ES5, 08 should be a syntax error (even when not in strict mode) however many parsers just interpret it as the number 8, willfully ignoring the ES5 spec.

@jordanbtucker
Copy link
Member

Here's another discrepancy between JSON5 and ES5 you may not be aware of. There is no such thing as a negative numeric literal in ES5 like there is in JSON and JSON5.

In ES5 -1 is treated as the two tokens - and 1 with the first being a unary negation operator and the second being a numeric literal. This means that - 1 is also a valid expression (note the space between - and 1). This also means that - - - 1 is a valid expression the same as -(-(-(1))). (Note that ---1 is not valid because -- is a different operator that cannot be applied to literals.)

In JSON and JSON5, there are no unary operators, so - 1 would throw a syntax error. The - must immediately precede the 1 with no characters in between. The same goes for + 1 in JSON5.

@d-frey
Copy link
Author

d-frey commented Jun 8, 2017

OK, thanks.

What about the free-standing grammar for JSON5? I think JSON5 would be great if it would stand on its own, referring to ECMAScript (5) might seem natural to you (and others that work with it), but it is completely alien to me (and possibly others that also don't use it).

Making it independent (and only taking care of being a (sub-set) of ECMAScript in the background) will likely turn it into a more concise and accessible standard. One which I already like, and I can only repeat my offer to help it grow to its full potential :) But ultimately, it is your decision.

@jordanbtucker
Copy link
Member

jordanbtucker commented Jun 8, 2017

Okay, so I've gotten myself in a tough spot now. A draft of the JSON5 spec exists at https://github.com/json5/json5-spec, but the information I've been giving you is based on the reference implentation of JSON5, which doesn't completely follow the spec.

It is a goal of mine to align the reference implementation to the spec, but I haven't found the time. I think effort would be better used by aligning the reference implementation to the spec rather than writing a new spec based on the implementation. When complete, this would be JSON5 version 1.0 and the spec and reference implementation would likely be frozen at that point (with occasional bug fixes).

Here is the only discrepancy I can think of between the reference implementation and the spec, but there may be others. According to the spec, strings allow any character to be escaped unless it has a special meaning. The characters that have special meaning are: b, f, n, r, t, v, x, u, 0, 1 through 9 (which are errors), \r (carriage return), and \n (line feed). Any other character is treated as the character itself without the preceding \. (For example, '\a' is the same as 'a' just like "\'" is the same as "'".) Taking after JSON, the JSON5 reference implementation only allows a subset of escapes and v and x aren't even included among them.

There is also some work to be done on the spec as listed at json5/json5-spec#1

@jordanbtucker
Copy link
Member

jordanbtucker commented Jun 8, 2017

Here's another discrepancy. The whitespace allowed in the spec does not match the whitespace allowed in the reference implementation. Namely \u2028 and \u2029 are valid whitespace characters in the spec, but they are not in the reference implementation.

If you take the implementation of jsonext and comment out all of the features added in ES6, you get a JSON5 implementation that follows the official JSON5 spec (with Unicode support). Those ES6 features are binary and octal literals (0o777 and 0b1010), template strings (`abc`), and Unicode code point escapes ('\u{20BB7}').

@d-frey
Copy link
Author

d-frey commented Jun 8, 2017

I still have a hard time to understand exactly what is now intended and what is not. For example: You refer to escape sequences from ES5, only the edit clarifies that v and x are excluded. What about 0-9?

Anyways, I went ahead and wrote a first version of a ABNF for JSON5. Does this look reasonable to you? (It's actually based on and extends the JSON ABNF from RFC 7159.

;--------------------------------------------------------
; Proposed grammar for JSON5 (http://json5.org/)
; Questions? mailto:d.frey@gmx.de

eol = %x0A / %x0A.0D / %x0D    ; Accept any line ending

; TODO: These probably need to be more complex, not just up to %x10FFFF
p-char = %x20-10FFFF          ; Printable character

p-char-non-star = %x20-29 / %x2B-10FFFF
                              ; Printable character except *
p-char-non-slash = %x20-2E / %x30-10FFFF
                              ; Printable character except /

; TODO: Allow sl-comment as the last line without eol?
sl-comment = %x2F.2F *( %x09 / p-char ) eol

ml-comment = %x2F.2A *( p-char-non-star / ( %x2A p-char-non-slash ) / %x09 / eol ) %x2A.2F

comment = sl-comment / ml-comment

; TODO: Add %xA0 (NBSP) and/or %xFEFF (BOM)?
; TODO: Shouldn't a BOM only be allowed at the start of the input?
ws = *(
          %x20 /              ; Space
          %x09 /              ; Horizontal tab
          eol /               ; Line ending
	  sl-comment /        ; Single-line comment
	  ml-comment          ; Multi-line comment
      )

;--------------------------------------------------------

begin-array     = ws %x5B ws  ; [ left square bracket
begin-object    = ws %x7B ws  ; { left curly bracket
end-array       = ws %x5D ws  ; ] right square bracket
end-object      = ws %x7D ws  ; } right curly bracket
name-separator  = ws %x3A ws  ; : colon
value-separator = ws %x2C ws  ; , comma

value-sep-opt   = [ value-separator ]

;--------------------------------------------------------

null  = %x6E.75.6C.6C         ; null
true  = %x74.72.75.65         ; true
false = %x66.61.6C.73.65      ; false

;--------------------------------------------------------

number = [ plus / minus ] ( nan / inf / hex / dec )

nan = %x4E.61.4E              ; NaN

inf = %x49.6E.66.69.6E.69.74.79
                              ; Infinity

hex = zero x 1*HEXDIG         ; 0xXXX...

dec = ( int [ frac0 ] / frac1 ) [ exp ]

decimal-point = %x2E          ; .

digit1-9 = %x31-39            ; 1-9

e = %x65 / %x45               ; e E

x = %x78 / %x58               ; x X

exp = e [ plus / minus ] 1*DIGIT

frac0 = decimal-point *DIGIT

frac1 = decimal-point 1*DIGIT

int = zero / ( digit1-9 *DIGIT )

plus = %x2B                   ; +

minus = %x2D                  ; -

zero = %x30                   ; 0

;--------------------------------------------------------

string = s-string / d-string

d-string = d-quotation-mark *( char / s-quotation-mark ) d-quotation-mark

s-string = s-quotation-mark *( char / d-quotation-mark ) s-quotation-mark

char = unescaped /
       escape (
           eol /              ; escaped newline
           %x62 /             ; b    backspace       U+0008
           %x66 /             ; f    form feed       U+000C
           %x6E /             ; n    line feed       U+000A
           %x72 /             ; r    carriage return U+000D
           %x74 /             ; t    tab             U+0009
           %x76 /             ; v    vtab            U+000B
           %x78 2HEXDIG /     ; xXX                  U+00XX
           %x75 4HEXDIG /     ; uXXXX                U+XXXX
           other-escape )     ; no special meaning

escape = %x5C                 ; \

d-quotation-mark = %x22       ; "

s-quotation-mark = %x27       ; '

unescaped = %x20-21 / %x23-26 / %x28-5B / %x5D-10FFFF

; TODO: Exclude 0-9?
other-escaped = %x20-61 / %x63-65 / %x67-6D / %x6F-71 / %x73 / %x77 / %x79-10FFFF

;--------------------------------------------------------

; TODO: Is [,] allowed? No.

array = begin-array [ value *( value-separator value ) value-sep-opt ] end-array

;--------------------------------------------------------

; TODO: Is {,} allowed? No.

object = begin-object [ member *( value-separator member ) value-sep-opt ] end-object

member = key name-separator value

key = string / identifier

begin-identifier = ALPHA / %x5F / %x24

continue-identifier = begin-identifier / DIGIT

identifier = begin-identifier *continue-identifier

;--------------------------------------------------------

value = null / true / false / number / string / array / object

JSON5-text = ws value ws

@jordanbtucker
Copy link
Member

That ABNF looks like a good starting point. NBSP is valid whitespace in JSON5. Section 7.1 explains why the BOM is allowed after the start of the document. 1-9 are not valid escape characters but 0 is.

Here are the character escapes in strings and how they should be handled. Each "sequence" refers to the character(s) immediately following the \ (U+005C) character.

Sequence Result Notes
b (U+0062) U+0008 Backspace
t (U+0074) U+0009 Horizontal tab
n (U+006E) U+00A0 Line feed
v (U+0076) U+000B Vertical tab
f (U+0066) U+000C Form feed
r (U+0072) U+000D Carriage return
0 (U+0030) U+0000 Nul character
1 through 9 (U+0031 through U+0039) Error Octal escapes are not supported
x (U+0078) followed by two hex digits The character with the code point of the hex number For example \x61 becomes a. It is a syntax error if the x is not followed by two hex digits.
u (U+0075) followed by four hex digits The character with the code point of the hex number For example \u0061 becomes a. It is a syntax error if the u is not followed by four hex digits.
U+00A0 Nothing Escaped line feed results in an empty string
U+00D0 U+00A0 Nothing Escaped carriage return followed by a line feed results in an empty string
U+00D0 Nothing Escaped carriage return not followed by a line feed results in an empty string
U+2028* Nothing Escaped line separator results in an empty string
U+2029* Nothing Escaped paragraph separator results in an empty string
Any other character The character itself For example \a becomes a, \\ becomes \, and \" becomes "

*Whether escaped line and paragraph separators should be allowed as line continuations is still up for discussion. See #70, which discusses these characters but does not touch on whether they should be treated as line continuations when escaped in strings.

@d-frey
Copy link
Author

d-frey commented Jun 8, 2017

Updated:

;--------------------------------------------------------
; Proposed grammar for JSON5 (http://json5.org/)
; Questions? mailto:d.frey@gmx.de
;--------------------------------------------------------

eol = %x0A / %x0A.0D / %x0D   ; End-of-line

;--------------------------------------------------------

p-char = %x20-10FFFF          ; Printable character

p-char-non-star = %x20-29 / %x2B-10FFFF
                              ; Printable character except *
p-char-non-slash = %x20-2E / %x30-10FFFF
                              ; Printable character except /

; TODO: Allow sl-comment as the last line without eol?

sl-comment = begin-sl-comment *( p-char / ows ) eol

ml-comment = begin-ml-comment *( p-char-non-star / ( %x2A p-char-non-slash ) / ows / eol ) end-ml-comment

comment = sl-comment / ml-comment

;--------------------------------------------------------

begin-sl-comment = %x2F.2F    ; //
begin-ml-comment = %x2F.2A    ; /*
end-ml-comment   = %x2A.2F    ; */

;--------------------------------------------------------

ws = *(
          %x20 /              ; Space
          ows /               ; Other space-like characters
          eol /               ; Line ending
	  sl-comment /        ; Single-line comment
	  ml-comment          ; Multi-line comment
      )

ows = %x09 /                  ; Horizontal tab
      %xA0                    ; NBSP
      %xFEFF                  ; BOM

;--------------------------------------------------------

begin-array     = ws %x5B ws  ; [ left square bracket
begin-object    = ws %x7B ws  ; { left curly bracket
end-array       = ws %x5D ws  ; ] right square bracket
end-object      = ws %x7D ws  ; } right curly bracket
name-separator  = ws %x3A ws  ; : colon
value-separator = ws %x2C ws  ; , comma

value-sep-opt   = [ value-separator ]

;--------------------------------------------------------

null  = %x6E.75.6C.6C         ; null
true  = %x74.72.75.65         ; true
false = %x66.61.6C.73.65      ; false

;--------------------------------------------------------

number = [ plus / minus ] ( nan / inf / hex / dec )

nan = %x4E.61.4E              ; NaN

inf = %x49.6E.66.69.6E.69.74.79
                              ; Infinity

hex = zero x 1*HEXDIG         ; 0xXXX...

dec = ( int [ frac0 ] / frac1 ) [ exp ]

decimal-point = %x2E          ; .

digit1-9 = %x31-39            ; 1-9

e = %x65 / %x45               ; e E

x = %x78 / %x58               ; x X

exp = e [ plus / minus ] 1*DIGIT

frac0 = decimal-point *DIGIT

frac1 = decimal-point 1*DIGIT

int = zero / ( digit1-9 *DIGIT )

plus = %x2B                   ; +

minus = %x2D                  ; -

zero = %x30                   ; 0

;--------------------------------------------------------

string = s-string / d-string

d-string = d-quotation-mark *( char / s-quotation-mark ) d-quotation-mark

s-string = s-quotation-mark *( char / d-quotation-mark ) s-quotation-mark

char = unescaped /
       escape (
           %x30 /             ; 0    nul             U+0000
           %x62 /             ; b    backspace       U+0008
           %x66 /             ; f    form feed       U+000C
           %x6E /             ; n    line feed       U+000A
           %x72 /             ; r    carriage return U+000D
           %x74 /             ; t    tab             U+0009
           %x76 /             ; v    vtab            U+000B
           %x78 2HEXDIG /     ; xXX                  U+00XX
           %x75 4HEXDIG /     ; uXXXX                U+XXXX

           eol /              ; end-of-line -> empty string
           %x2028 /           ; line separator -> empty string
           %x2029 /           ; paragraph separator -> empty string
                              ; TODO: Remove U+2028 and U+2029? See #70

           other-escape )     ; the character itself

escape = %x5C                 ; \

d-quotation-mark = %x22       ; "

s-quotation-mark = %x27       ; '

unescaped = %x20-21 / %x23-26 / %x28-5B / %x5D-10FFFF

other-escaped = %x20-2F / %x3A-61 / %x63-65 / %x67-6D / %x6F-71 / %x73 / %x77 / %x79-10FFFF

;--------------------------------------------------------

array = begin-array [ value *( value-separator value ) value-sep-opt ] end-array

;--------------------------------------------------------

object = begin-object [ member *( value-separator member ) value-sep-opt ] end-object

member = key name-separator value

key = string / identifier

begin-identifier = ALPHA / %x5F / %x24
                              ; ALPHA / "_" / "$"

continue-identifier = begin-identifier / DIGIT

identifier = begin-identifier *continue-identifier

;--------------------------------------------------------

value = null / true / false / number / string / array / object

JSON5-text = ws value ws

@d-frey
Copy link
Author

d-frey commented Jun 8, 2017

Remarks about the grammar:

  • Allows escaped U+2028 and U+2029 for now, they are easy to remove if they are not wanted.
  • Disallows trailing commas on empty arrays/objects. OK? ([,], {,})
  • Restricts identifiers to (ALPHA/"_"/"$") *(ALNUM/"_"/"$").
  • Adding \0 and \xXX as escape-sequences makes sense thinking about bytes, but JSON strings are required to be valid unicode strings - and those may not contain embedded nul-bytes IIUC. What is the intended semantics of those? Or does JSON5 allow "binary" strings, meaning any byte-combination is allowed? And if that is the case, what is the semantics of escaped surrogate pairs?
  • Should the grammar be committed somewhere?

@jordanbtucker
Copy link
Member

jordanbtucker commented Jun 9, 2017

  • Disallows trailing commas on empty arrays/objects. OK? ([,], {,})

    • This is correct. A comma can only appear after an object member or array element.
  • Adding \0 and \xXX as escape-sequences makes sense thinking about bytes, but JSON strings are required to be valid unicode strings - and those may not contain embedded nul-bytes IIUC. What is the intended semantics of those? Or does JSON5 allow "binary" strings, meaning any byte-combination is allowed? And if that is the case, what is the semantics of escaped surrogate pairs?

    • The only Unicode code points that are invalid are values U+D800 through U+DFFF and values larger than U+10FFFF. This means that "\0", "\x00", "\u0000", "\xFF", and "\u00FF" are all valid Unicode strings in ES5 and JSON5 (and the \u versions are valid in JSON). So it's technically possible to store binary data in a string this way.

      Although it's possible to store binary data as a string of character escapes, it's recommended to encode the data as Base64 and store the result as a string.

      The string '\uD800' doesn't represent a valid Unicode string because it's only half of a surrogate pair. JSON5 doesn't define how to handle this situation. I would encourage implementations to either throw an error or replace the character with U+FFFD.

      Historically, strings and byte arrays were equivalent. The string 'abc' was stored in memory and on disk as the byte array 61 62 63. Likewise, '\x81\x82\x83' was stored as 81 82 83. So it made sense to represent byte arrays as strings. However, once Unicode became popular, strings started being stored in memory as UTF-16, so 'abc' became 00 61 00 62 00 63 and when stored on disk or sent over the network, they were usually represented as UTF-8, so '\x81\x82\x83' became C2 81 C2 82 C2 83.

  • Should the grammar be committed somewhere?

@d-frey
Copy link
Author

d-frey commented Jun 9, 2017

OK, so I'll just add the raw bytes for \0 and \xXX - let's see how that will turn out :)

Next remarks:

  • The grammar doesn't allow it (yet), but I guess a single-line comment is allowed to be at the end of the input without an explicit eol? (complete input exampe: "foo" // a string - no newline anywhere)
  • Are negative (or explicitly positive) hexadecimal values allowed? -0x1234 or +0x1234 instead of just plain 0x1234?

@jordanbtucker
Copy link
Member

  • The grammar doesn't allow it (yet), but I guess a single-line comment is allowed to be at the end of the input without an explicit eol? (complete input exampe: "foo" // a string - no newline anywhere)
    • Yes.
  • Are negative (or explicitly positive) hexadecimal values allowed? -0x1234 or +0x1234 instead of just plain 0x1234?

@d-frey
Copy link
Author

d-frey commented Jun 9, 2017

I did some final changes and polishing and created a PR as requested. Now I'll have to do my homework and fix our library's JSON5 parser :)

@d-frey
Copy link
Author

d-frey commented Jun 11, 2017

I think I now implemented everything in our library, see https://github.com/taocpp/json

Have you found the time to review the changes? (grammar-wise wrt the JSON grammar from RFC 7159, not our library)

@jordanbtucker
Copy link
Member

I recommend running your library against the JSON5 test suite at https://github.com/json5/json5-tests.

@d-frey
Copy link
Author

d-frey commented Jun 11, 2017

Running the test-suite is not the same as reviewing a grammar as a human.

Also, that test suite does not contain JSON reference strings. Currently, you test suite only tests whether or not something parses, but not how. With a reference string, I could at least compare the result of parsing JSON5 to something from a well-known and working JSON parser. Example:

"hello,\
 world"

should be identical to this JSON:

"hello, world"

and not this:

"hello,\n world"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants