Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Nokogiri/JRuby line numbering off by one when XML prolog is in the source document #2380

Closed
dabdine opened this issue Dec 8, 2021 · 10 comments

Comments

@dabdine
Copy link
Contributor

dabdine commented Dec 8, 2021

Please describe the bug

Nokogiri under JRuby does not properly report line numbers when XML prolog is included at the beginning of a document.

Help us reproduce what you're seeing

See this DRAFT PR for tests that fail under JRuby (but should succeed under CRuby -- I can't yet validate that and I created a PR to attempt to use the Nokogiri CI to validate this for me).
#2379

Expected behavior

The line numbers (as demonstrated by the tests) should not be off by one in Nokogiri under JRuby.

Environment

N/A

Additional context

Test output from the referenced PR:

Failure:
Nokogiri::XML::Node::#line#test_0003_properly numbers lines with documents containing XML multiline comments [/home/dabdine/devel/nokogiri/test/xml/test_node.rb:1290]
Minitest::Assertion: Expected: 6
  Actual: 4

Failure:
Nokogiri::XML::Node::#line#test_0001_properly numbers lines with documents containing XML prolog [/home/dabdine/devel/nokogiri/test/xml/test_node.rb:1256]
Minitest::Assertion: Expected: 3
  Actual: 2
@dabdine dabdine added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Dec 8, 2021
@flavorjones
Copy link
Member

👋 Thanks for opening this issue. Line numbering in the JRuby implementation is known to be unreliable, you can read about why at #1223 and see how it was made slightly less unreliable (or at least consistently unreliable) at #2177.

You can see by my comments in #1223 that I've exhausted my knowledge of the Java APIs to implement this correctly. If you've got the energy and/or the knowledge to implement this correctly, I would very very happily merge a PR that fixes it.

@flavorjones flavorjones added platform/jruby and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Dec 8, 2021
@dabdine
Copy link
Contributor Author

dabdine commented Dec 8, 2021

@flavorjones thanks! I did read both of those. Minimally I wanted to document this edge case. I saw that you had the XML prolog tag in some of the comments there, but it didn't stand out as a particular edge case, so figured filing a defect could be helpful for tracking purposes. Let me take a look at the line numbering logic to see if there are ways to improve that.

@flavorjones
Copy link
Member

@dabdine I'm curious if you have plans to work on a fix for this? Again, just to be clear, I'm unlikely to work on this and this feels like a good opportunity for a JRuby user to help out by contributing fixes if they can.

@dabdine
Copy link
Contributor Author

dabdine commented Dec 16, 2021

@flavorjones from my analysis, I think we'd need to rewrite the entire DOM parser using a SAX parser to track line numbers as AFAICT Xerces doesn't track this, and the Java interfaces/classes certainly do not. Even then, I believe the SAX parser will track the line number as the line with the closing tag. So, we would also need to validate that the line number is consistent for multiline XML between libxml. If the SAX parser does return the line number of the closing tag in a multiline XML document, we would need to use sibling traversal to recover the starting tag line number.

I don't quite have the time to rewrite the DOM parsing logic at the moment, which would undoubtedly be a significant change the current Java implementation.

@flavorjones
Copy link
Member

OK, you've reached the same conclusion as me, then, which is a) it will be an invasive change b) which doesn't seem worth the investment of developer time right now. Blurgh.

I don't usually like to leave an issue open if nobody's going to work on it (some context around why I feel this way is here, but I'm happy to leave this open for a bit if you think it's important. WDYT?

@dabdine
Copy link
Contributor Author

dabdine commented Dec 16, 2021

I think we close it and maybe document the failing jruby cases? Namely that it doesn't take into account XML prolog and multi line xml comments (however, single line does work), maybe referencing this issue. What do you think?

@flavorjones
Copy link
Member

@dabdine 🤔 Yes, how about I take #2379 and make those tests "pending" (in the sense that they fail, and we expect them to fail, but we wish they didn't)?

flavorjones added a commit to dabdine/nokogiri that referenced this issue Dec 16, 2021
See context in sparklemotion#2380. The fix on JRuby is expensive and is probably
not worth the investment of developer time right now.

But let's preserve those failures as a known issue by introducing the
`pending` and `pending_if` test helpers.
@dabdine
Copy link
Contributor Author

dabdine commented Dec 17, 2021

@flavorjones works for me

@flavorjones
Copy link
Member

See #2379 for my observation that JRuby/Xerces is emitting line numbers from the final formatted DOM, while CRuby/libxml2 emits line numbers from the original parsed document. I'd like to accept this difference in semantic behavior and document it.

flavorjones added a commit to dabdine/nokogiri that referenced this issue Dec 22, 2021
See context in sparklemotion#2380. The fix on JRuby is expensive and is probably
not worth the investment of developer time right now.

But let's preserve those failures as a known issue by introducing the
`pending` and `pending_if` test helpers.
@flavorjones
Copy link
Member

Merged #2379

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants