Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to expose Position information used for error messages in JSonReader in public API #2373

Open
jurgenvinju opened this issue Apr 12, 2023 · 8 comments

Comments

@jurgenvinju
Copy link

Thanks for maintaining this library!

Problem solved by the feature

  • Many projects use JSON notation for their DSL's syntax
  • Reusing Gson implies high quality and compatible parsing for those DSLs
  • IDE support for DSLs requires position information (offset/length, start line/column, end line/column
  • For example to create Language Servers for the LSP for the given (embedded) DSLs

Feature description

Proposal to expose the following fields of JsonReader:

private int pos = 0;
private int lineNumber = 0;
private int lineStart = 0;

With these public methods:

public int getPosition() {
  return pos;
}

public int getLineNumber() {
  return lineNumber + 1; // posix has the first line at 1
}

public int getColumnNumber() {
  return pos - lineStart; // posix has columns start at 0
}

Using these methods just before beginObject() and just after endObject() you get an accurate measurement
of the span of every object in the character stream.

Alternatives / workarounds

  • We currently use the reflection API to access these fields in the usethesource/rascal meta-programming project.
  • That seems to work fine, but it's not as nice as it could be :-)
@Marcono1234
Copy link
Collaborator

We currently use the reflection API to access these fields

How reliable are these values though? It has been a while since I had a look at that but there might be cases where JsonReader updates these values before the token has actually been consumed (probably related: #1764).

@jurgenvinju
Copy link
Author

So far so good. How about I test them for a few weeks and see what comes up?

  • The pos value can indeed be off due to "lookahead" which is later corrected by subtracting again. It matters a lot when you ask for the position. So far asking before or after beginObject and endObject seems to be completely safe. The code hints at issues when items are missing completely, such as empty arrays.
  • The lineNumber counter is extremely simple and does not seem to be in danger of being incorrect
  • lineStart is simple in similar ways.

@Marcono1234
Copy link
Collaborator

Marcono1234 commented Apr 12, 2023

How about I test them for a few weeks and see what comes up?

You don't have to test this only for this; I was just curious if you are already aware of any issues.

My main concern is that by exposing this information with a public API JsonReader would make a strong commitment that the values are correct (or at least users might expect this). Whereas currently the location information in the exception messages is probably on a best effort basis (?). A big problem with exposing this information is also that it exposes the implementation details of when and how JsonReader skips whitespace or JSON tokens.
For example a call to getPosition() would have to guarantee that all whitespace in front of the next token has been skipped. Similarly a call to getPosition() after reading an array item or a object member name must not have skipped , or : (or at least I assume that is what users expect). Maybe JsonReader already adheres to that by accident, but it imposes some restrictions on how JsonReader can parse the data.

But maybe these concerns are unjustified and supporting this in a reliable way would not be that difficult.

Note that I am not a direct member of this project; they might think differently about this.

@jurgenvinju
Copy link
Author

jurgenvinju commented Apr 13, 2023

That makes perfect sense. Your concern is with correctness and what the definition of it is.

Let's have a go at defining what a "correct" implementation of getPosition() would do, and relate that specifically to the post-conditions of every other method in JsonReader. Later we can see if this definition is indeed implemented by the class and then we also know how to test for these properties. pos^ means pos before executing a method and pos is after executing the method.

  • beginArray: input[pos] == '['
  • endArray: input[pos-1] == ']'
  • beginObject: input[pos] == '{'
  • endObject: input[pos-1] == '}'
  • hasNext: pos^ == pos (pos did not change)
  • peek(): pos^ == pos
  • String nextName(): pos = pos^ + length(returnValue)
  • String nextString(): input[pos^] == '"' ==> input[pos-1] == '"' && pos >= pos^ + length(returnValue) // unescaping makes the string possibly shorter, but not longer
  • String nextBoolean(): pos = pos^ + length(returnValue)
  • String nextNull(): pos = pos^ + 4; // 'null' has four characters
  • String nextDouble(): pos >= pos^ + length(returnValue) // different unicode escapes can still lead to shorter string doubles
  • String nextLong(): pos >= pos^ + length(returnValue) // different unicode escapes can still lead to shorter string longs
  • String nextInt(): pos >= pos^ + length(returnValue) // different unicode escapes can still lead to shorter string longs
  • void close(): pos == pos^
  • skipValue(): pos > pos^
  • toString(): pos == pos^
  • getPath(): pos == pos^

WDYT? did I get this right?

@jurgenvinju
Copy link
Author

For example a call to getPosition() would have to guarantee that all whitespace in front of the next token has been skipped.

This one follows from post-conditions such as input[pos]=='[' ; I've tried not to generalize to "all" whitespace and "any" token, in order for the post-conditions to remain as simple as possible. Nevertheless the general idea is exactly as you suggested.

@Marcono1234
Copy link
Collaborator

did I get this right?

Note sure for hasNext() and peek(). The current implementation does not eagerly peek at the next value, but only once you call these methods (or any value consuming method). Similarly all assumptions about for pos^ might only hold if you first either called hasNext() or peek() and then afterwards pos^ = getPosition() before calling methods for consuming the value.

For array and objects I think it should be input[pos^] = [ | { | ] | } and input[pos-1] = [ | { | ] | }, shouldn't it? That is, pos^ should point at the opening / closing bracket and pos should point right behind it.

For nextString() I think it should be pos >= ... + 1 to account for the closing " (which is at pos - 1, as shown in your comment).

Another point might be lenient-mode of JsonReader; I am not sure if it makes sense to make any guarantees there for the position methods or if the behavior there should just be declared undefined, because detection of unquoted strings and for example trailing , being an implicit null might lead to incorrect positions.

@Marcono1234
Copy link
Collaborator

Marcono1234 commented Apr 13, 2023

Maybe it would be easier if you looked for a JSON parser which is explicitly designed to report accurate position information, possibly one which is used by IDEs such as in IntelliJ IDEA; haven't checked though which library they are using.

Or alternatively a parser generator library such as ANTLR could be used, they already have an existing grammar for JSON: https://github.com/antlr/grammars-v4/blob/master/json/JSON.g4 (but the licensing might be a bit unclear, see antlr/grammars-v4#1607)
That would have the advantage that you get accurate locations for start and end of tokens.

@eamonnmcmanus
Copy link
Member

I agree with the points made by @Marcono1234. If we exposed this information, there might be some impact on performance of certain cases (because the position we record now doesn't quite make sense). We would also be tying our hands for future parser changes, since we'd need to be sure to preserve the reported positions.

I think the current situation, where you can get this information but only through reflection, is probably better. That makes it clear that you're going through a back door and that the reported positions might change in future versions of Gson. We probably wouldn't change them gratuitously: if you want, you can contribute a test that would fail if they did change. We still wouldn't guarantee anything, but at least we would know if one of our changes was going to alter the reported positions.

Fundamentally I think there's a conflict between a fast parser, which Gson aims to be, and a detailed one. It might indeed be better to look at another JSON library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants