New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Nokogiri::HTML::Document#to_s behaves oddly under JRuby #3103
Comments
Hey @postmodern, thanks for opening this issue. The short version is: yes, the various parsers used by Nokogiri have behavioral differences, particularly around serialization. While there are some options for controlling output, the parsers are just different. And we long ago decided to write this in the README as one of the project's guiding principles:
The git history is littered with commits trying to get outputs to match, but if you feel strongly enough about it to investigate, and you find a way to fix it that isn't something gross, I'd be open to it. But my caveat is that whitespace generally doesn't have semantic meaning, can I ask why this is important or disruptive for you? |
@flavorjones that whitespace shouldn't be there, and wasn't there previously, also the empty If this problem keeps getting worse, I may have to remove JRuby from the CI, because I don't want to have to constantly adjust my specs for the quirks of the underlying Java XML/HTML parser. |
@postmodern I understand you're frustrated. I have been asking for help with Nokogiri's JRuby java implementation for over 14 years (receipt) and I share your frustration that it doesn't work better and isn't more consistent with the C extension. But the reality is that HTML4 parsers often behave differently from each other (I did a whole talk based on this fact at Rails World last year), and there's no easy way for Nokogiri to make the parsers behave the same. If you look at the test suites for Loofah and Rails::HTML::Sanitizer you'll see a few places where I've had to do When you say "wasn't there previously" I suspect this means "before Nokogiri switched to Would you consider instead using an approach like You might also consider using pattern matching to assert on the key parts of the DOM structure (introduced in Nokogiri v1.14.0). I'd also recommend, if you can, to use the HTML5 parser in Ronin (though, HTML5 support is not present in JRuby despite my requests for assistance which doesn't help your situation). |
And it may be interesting to note that |
I'm happy to continue the conversation, but I'm going to mark the issue closed since there's nothing we can easily do in Nokogiri to address the issues you're seeing; and because of that difficulty, making the behavior of the JRuby impl exactly match the C impl is an anti-goal for the project. |
I understand that nokogiri uses a different XML/HTML parsing library for it's JRuby bindings than libxml2. I noticed recently that
Nokogiri::HTML::Document#to_s
produces strange output under JRuby. Note the actual output contains some unusual whitespace/indentation that shouldn't be there, and an empty<head></head>
element.Steps To Reproduce
Expected Results
Actual Results
Unit Test
Environment
The text was updated successfully, but these errors were encountered: