Improvements in string parse #234

jlguardi · 2020-06-18T10:47:41Z

I've submitted a previous PR (#233) with some improvements for string parsing. My proposed version has this features:

Avoid OOM or Stack size exceed generated with long strings in the current version v2.1.3
Time parsing improvements. It can parse a 10MB string in just 1 second while v2.1.3 requires more than 2 mins. Longer sizes crashes current version.
Keeps all features of v2.1.3

However, this is not enough for my application (I receive strings with more than 100MB). These strings are parsed in ~6 secs and this is too much for my application. With JSON, I can parse these strings in less than a second (600-800 ms) but I can't use this parser due to I use JSON5 standard and some features aren't supported by JSON.

So I've implemented a new parser for strings (https://github.com/jlguardi/json5/tree/string_replace) which is able to parse in 700ms and I keep almost all features of v2.1.3:

Parsing time for >100MB strings in 700ms
All tests are passed correctly (but error position)
escaped sequences and special characters produce the same output
unsupported chars are detected and reported

However, I have 2 changes from current version in the error reporting:

I report the full error sequence. I mean, if it parses '"\u000g"', the current version reports an error like this:

JSON5: invalid character 'g' at 1:6

But my version reports this error:

JSON5: invalid sequence '\u000g' at 1:1

I guess this report is more descriptive than the previous one.

And the most important problem is that I can't identify the exact position of the error if I get a parsing error. So on parsing this string '"aaaa\\n\n\n\u000g"' the current v2.1.3 reports:

JSON5: invalid character 'g' at 2:12

but mine reports (by now):

JSON5: invalid sequence '\u000g' at 1:1

I could work on the error reporting to get a better location of the error but I'm quite sure it isn't possible report the real position due to escaped characters. Probably I could provide something like:

JSON5: invalid sequence '\u000g' near to 2:4

What do you suggest? What is more important, error reporting or execution time?

By now, I'll try to return the same location for error reporting but I'm not really sure if it is possible.

The text was updated successfully, but these errors were encountered:

jordanbtucker · 2020-06-18T15:55:07Z

Thanks for your work on this. Performance is something I haven't had time to work on, so this is a great help.

I like the error reporting changes. It makes things clearer. I would definitely say that accuracy is more important than performance though, especially since this is a reference parser.

jordanbtucker self-assigned this Jun 18, 2020

jordanbtucker added the enhancement 👍 label Jun 18, 2020

Gerrit0 mentioned this issue Sep 5, 2021

Switch from json5 to vscode-jsonc shikijs/shiki#222

Merged

jordanbtucker added this to the v3.0.0 milestone Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements in string parse #234

Improvements in string parse #234

jlguardi commented Jun 18, 2020

jordanbtucker commented Jun 18, 2020

Improvements in string parse #234

Improvements in string parse #234

Comments

jlguardi commented Jun 18, 2020

jordanbtucker commented Jun 18, 2020