Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements in string parse #234

Open
jlguardi opened this issue Jun 18, 2020 · 1 comment
Open

Improvements in string parse #234

jlguardi opened this issue Jun 18, 2020 · 1 comment
Assignees
Milestone

Comments

@jlguardi
Copy link

I've submitted a previous PR (#233) with some improvements for string parsing. My proposed version has this features:

  • Avoid OOM or Stack size exceed generated with long strings in the current version v2.1.3
  • Time parsing improvements. It can parse a 10MB string in just 1 second while v2.1.3 requires more than 2 mins. Longer sizes crashes current version.
  • Keeps all features of v2.1.3

However, this is not enough for my application (I receive strings with more than 100MB). These strings are parsed in ~6 secs and this is too much for my application. With JSON, I can parse these strings in less than a second (600-800 ms) but I can't use this parser due to I use JSON5 standard and some features aren't supported by JSON.

So I've implemented a new parser for strings (https://github.com/jlguardi/json5/tree/string_replace) which is able to parse in 700ms and I keep almost all features of v2.1.3:

  • Parsing time for >100MB strings in 700ms
  • All tests are passed correctly (but error position)
  • escaped sequences and special characters produce the same output
  • unsupported chars are detected and reported

However, I have 2 changes from current version in the error reporting:

  • I report the full error sequence. I mean, if it parses '"\u000g"', the current version reports an error like this:
JSON5: invalid character 'g' at 1:6

But my version reports this error:

JSON5: invalid sequence '\u000g' at 1:1

I guess this report is more descriptive than the previous one.

  • And the most important problem is that I can't identify the exact position of the error if I get a parsing error. So on parsing this string '"aaaa\\n\n\n\u000g"' the current v2.1.3 reports:
JSON5: invalid character 'g' at 2:12

but mine reports (by now):

JSON5: invalid sequence '\u000g' at 1:1

I could work on the error reporting to get a better location of the error but I'm quite sure it isn't possible report the real position due to escaped characters. Probably I could provide something like:

JSON5: invalid sequence '\u000g' near to 2:4

What do you suggest? What is more important, error reporting or execution time?

By now, I'll try to return the same location for error reporting but I'm not really sure if it is possible.

@jordanbtucker
Copy link
Member

Thanks for your work on this. Performance is something I haven't had time to work on, so this is a great help.

I like the error reporting changes. It makes things clearer. I would definitely say that accuracy is more important than performance though, especially since this is a reference parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants