[bug] Nokogiri::HTML::Document#to_s behaves oddly under JRuby #3103

postmodern · 2024-01-18T20:10:38Z

I understand that nokogiri uses a different XML/HTML parsing library for it's JRuby bindings than libxml2. I noticed recently that Nokogiri::HTML::Document#to_s produces strange output under JRuby. Note the actual output contains some unusual whitespace/indentation that shouldn't be there, and an empty <head></head> element.

Steps To Reproduce

#!/usr/bin/env ruby

begin
  require 'bundler/setup'
rescue LoadError => error
  abort error.message
end

require 'nokogiri'

doc = Nokogiri::HTML(
<<~HTML
  <html>
    <body>
      <p id="foo">Foo</p>
    </body>
  </html>
HTML
)

puts doc.to_s

Expected Results

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <p id="foo">Foo</p>
  </body>
</html>

Actual Results

<html><head></head><body>
    <p id="foo">Foo</p>
  
</body></html>

Unit Test

#! /usr/bin/env ruby

require 'nokogiri'
require 'minitest/autorun'

class Test < Minitest::Spec
  describe "Nokogiri::HTML::Document#to_s" do
    it "should properly convert the documents nodes back into HTML" do
      html = <<~HEREDOC
        <html>
          <body>
            <p id="foo">Foo</p>
          </body>
        </html>
      HEREDOC

      EXPECTED_HTML = <<~HEREDOC
        <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
        <html>
          <body>
            <p id="foo">Foo</p>
          </body>
        </html>
      HEREDOC
      
      doc = Nokogiri::HTML::Document.parse(html)
      
      assert_equal EXPECTED_HTML, doc.to_s
    end
  end
end

Environment

jruby 9.4.3.0 (3.1.4) 2023-06-07 3086960792 OpenJDK 64-Bit Server VM 17.0.9+9 on 17.0.9+9 +jit [x86_64-linux]
nokogiri 1.16.0

# Nokogiri (1.16.0)
    ---
    warnings: []
    nokogiri:
      version: 1.16.0
    ruby:
      version: 3.1.4
      platform: java
      gem_platform: universal-java-17
      description: jruby 9.4.3.0 (3.1.4) 2023-06-07 3086960792 OpenJDK 64-Bit Server VM
        17.0.9+9 on 17.0.9+9 +jit [x86_64-linux]
      engine: jruby
      jruby: 9.4.3.0
    other_libraries:
      isorelax:isorelax: '20030108'
      net.sf.saxon:Saxon-HE: 9.6.0-4
      net.sourceforge.htmlunit:neko-htmlunit: 2.63.0
      nu.validator:jing: 20200702VNU
      org.nokogiri:nekodtd: 0.1.11.noko2
      xalan:serializer: 2.7.3
      xalan:xalan: 2.7.3
      xerces:xercesImpl: 2.12.2
      xml-apis:xml-apis: 1.4.01

The text was updated successfully, but these errors were encountered:

flavorjones · 2024-01-19T15:01:52Z

Hey @postmodern, thanks for opening this issue.

The short version is: yes, the various parsers used by Nokogiri have behavioral differences, particularly around serialization. While there are some options for controlling output, the parsers are just different. And we long ago decided to write this in the README as one of the project's guiding principles:

be a thin-as-reasonable layer on top of the underlying parsers, and don't attempt to fix behavioral differences between the parsers

The git history is littered with commits trying to get outputs to match, but if you feel strongly enough about it to investigate, and you find a way to fix it that isn't something gross, I'd be open to it.

But my caveat is that whitespace generally doesn't have semantic meaning, can I ask why this is important or disruptive for you?

postmodern · 2024-01-20T02:30:36Z

@flavorjones that whitespace shouldn't be there, and wasn't there previously, also the empty <head></head> that is being implicitly added is also a new artifact. Those don't seem right. This has caused me to have to add lots of if RUBY_ENGINE == 'jruby' checks to my specs to account for Nokogiri differences under JRuby.
https://github.com/ronin-rb/ronin-support-web/blob/59a70163f780c2a2d0040e9d2776edf5f5aea381/spec/html_spec.rb#L15-L32

If this problem keeps getting worse, I may have to remove JRuby from the CI, because I don't want to have to constantly adjust my specs for the quirks of the underlying Java XML/HTML parser.

flavorjones · 2024-01-21T19:45:38Z

@postmodern I understand you're frustrated. I have been asking for help with Nokogiri's JRuby java implementation for over 14 years (receipt) and I share your frustration that it doesn't work better and isn't more consistent with the C extension.

But the reality is that HTML4 parsers often behave differently from each other (I did a whole talk based on this fact at Rails World last year), and there's no easy way for Nokogiri to make the parsers behave the same. If you look at the test suites for Loofah and Rails::HTML::Sanitizer you'll see a few places where I've had to do if jruby branching because of parser differences.

When you say "wasn't there previously" I suspect this means "before Nokogiri switched to neko-htmlunit from the no-longer-supported (and insecure!) nekohtml in #2496 / v1.14.0.

Would you consider instead using an approach like assert_dom_equal from https://github.com/rails/rails-dom-testing which asserts on the equivalence of the DOM nodes (or a subtree thereof) rather than asserting on the bytes that are serialized by the parser? This is what I have always recommended because HTML serialization has always been fraught between the parsers.

You might also consider using pattern matching to assert on the key parts of the DOM structure (introduced in Nokogiri v1.14.0).

I'd also recommend, if you can, to use the HTML5 parser in Ronin (though, HTML5 support is not present in JRuby despite my requests for assistance which doesn't help your situation).

flavorjones · 2024-01-21T19:59:53Z

And it may be interesting to note that neko-htmlunit's output on the test you reference matches the output of libgumbo (Nokogiri::HTML5's parser) which is a bit of a signal that the library maintainers have been trying to migrate its parsing and serialization behavior closer to HTML5 which is standardized by the spec (in a way that HTML4 is not specified).

flavorjones · 2024-01-21T20:01:59Z

I'm happy to continue the conversation, but I'm going to mark the issue closed since there's nothing we can easily do in Nokogiri to address the issues you're seeing; and because of that difficulty, making the behavior of the JRuby impl exactly match the C impl is an anti-goal for the project.

postmodern added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jan 18, 2024

flavorjones added state/will-close and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jan 19, 2024

flavorjones closed this as not planned Won't fix, can't repro, duplicate, stale Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Nokogiri::HTML::Document#to_s behaves oddly under JRuby #3103

[bug] Nokogiri::HTML::Document#to_s behaves oddly under JRuby #3103

postmodern commented Jan 18, 2024 •

edited

flavorjones commented Jan 19, 2024

postmodern commented Jan 20, 2024 •

edited

flavorjones commented Jan 21, 2024

flavorjones commented Jan 21, 2024

flavorjones commented Jan 21, 2024

[bug] Nokogiri::HTML::Document#to_s behaves oddly under JRuby #3103

[bug] Nokogiri::HTML::Document#to_s behaves oddly under JRuby #3103

Comments

postmodern commented Jan 18, 2024 • edited

Steps To Reproduce

Expected Results

Actual Results

Unit Test

Environment

flavorjones commented Jan 19, 2024

postmodern commented Jan 20, 2024 • edited

flavorjones commented Jan 21, 2024

flavorjones commented Jan 21, 2024

flavorjones commented Jan 21, 2024

postmodern commented Jan 18, 2024 •

edited

postmodern commented Jan 20, 2024 •

edited