Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Nokogiri::HTML::Document#to_s behaves oddly under JRuby #3103

Closed
postmodern opened this issue Jan 18, 2024 · 5 comments
Closed

[bug] Nokogiri::HTML::Document#to_s behaves oddly under JRuby #3103

postmodern opened this issue Jan 18, 2024 · 5 comments

Comments

@postmodern
Copy link

postmodern commented Jan 18, 2024

I understand that nokogiri uses a different XML/HTML parsing library for it's JRuby bindings than libxml2. I noticed recently that Nokogiri::HTML::Document#to_s produces strange output under JRuby. Note the actual output contains some unusual whitespace/indentation that shouldn't be there, and an empty <head></head> element.

Steps To Reproduce

#!/usr/bin/env ruby

begin
  require 'bundler/setup'
rescue LoadError => error
  abort error.message
end

require 'nokogiri'

doc = Nokogiri::HTML(
<<~HTML
  <html>
    <body>
      <p id="foo">Foo</p>
    </body>
  </html>
HTML
)

puts doc.to_s

Expected Results

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <p id="foo">Foo</p>
  </body>
</html>

Actual Results

<html><head></head><body>
    <p id="foo">Foo</p>
  
</body></html>

Unit Test

#! /usr/bin/env ruby

require 'nokogiri'
require 'minitest/autorun'

class Test < Minitest::Spec
  describe "Nokogiri::HTML::Document#to_s" do
    it "should properly convert the documents nodes back into HTML" do
      html = <<~HEREDOC
        <html>
          <body>
            <p id="foo">Foo</p>
          </body>
        </html>
      HEREDOC

      EXPECTED_HTML = <<~HEREDOC
        <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
        <html>
          <body>
            <p id="foo">Foo</p>
          </body>
        </html>
      HEREDOC
      
      doc = Nokogiri::HTML::Document.parse(html)
      
      assert_equal EXPECTED_HTML, doc.to_s
    end
  end
end

Environment

  • jruby 9.4.3.0 (3.1.4) 2023-06-07 3086960792 OpenJDK 64-Bit Server VM 17.0.9+9 on 17.0.9+9 +jit [x86_64-linux]
  • nokogiri 1.16.0
# Nokogiri (1.16.0)
    ---
    warnings: []
    nokogiri:
      version: 1.16.0
    ruby:
      version: 3.1.4
      platform: java
      gem_platform: universal-java-17
      description: jruby 9.4.3.0 (3.1.4) 2023-06-07 3086960792 OpenJDK 64-Bit Server VM
        17.0.9+9 on 17.0.9+9 +jit [x86_64-linux]
      engine: jruby
      jruby: 9.4.3.0
    other_libraries:
      isorelax:isorelax: '20030108'
      net.sf.saxon:Saxon-HE: 9.6.0-4
      net.sourceforge.htmlunit:neko-htmlunit: 2.63.0
      nu.validator:jing: 20200702VNU
      org.nokogiri:nekodtd: 0.1.11.noko2
      xalan:serializer: 2.7.3
      xalan:xalan: 2.7.3
      xerces:xercesImpl: 2.12.2
      xml-apis:xml-apis: 1.4.01
@postmodern postmodern added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jan 18, 2024
@flavorjones
Copy link
Member

Hey @postmodern, thanks for opening this issue.

The short version is: yes, the various parsers used by Nokogiri have behavioral differences, particularly around serialization. While there are some options for controlling output, the parsers are just different. And we long ago decided to write this in the README as one of the project's guiding principles:

be a thin-as-reasonable layer on top of the underlying parsers, and don't attempt to fix behavioral differences between the parsers

The git history is littered with commits trying to get outputs to match, but if you feel strongly enough about it to investigate, and you find a way to fix it that isn't something gross, I'd be open to it.

But my caveat is that whitespace generally doesn't have semantic meaning, can I ask why this is important or disruptive for you?

@flavorjones flavorjones added state/will-close and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Jan 19, 2024
@postmodern
Copy link
Author

postmodern commented Jan 20, 2024

@flavorjones that whitespace shouldn't be there, and wasn't there previously, also the empty <head></head> that is being implicitly added is also a new artifact. Those don't seem right. This has caused me to have to add lots of if RUBY_ENGINE == 'jruby' checks to my specs to account for Nokogiri differences under JRuby.
https://github.com/ronin-rb/ronin-support-web/blob/59a70163f780c2a2d0040e9d2776edf5f5aea381/spec/html_spec.rb#L15-L32

If this problem keeps getting worse, I may have to remove JRuby from the CI, because I don't want to have to constantly adjust my specs for the quirks of the underlying Java XML/HTML parser.

@flavorjones
Copy link
Member

@postmodern I understand you're frustrated. I have been asking for help with Nokogiri's JRuby java implementation for over 14 years (receipt) and I share your frustration that it doesn't work better and isn't more consistent with the C extension.

But the reality is that HTML4 parsers often behave differently from each other (I did a whole talk based on this fact at Rails World last year), and there's no easy way for Nokogiri to make the parsers behave the same. If you look at the test suites for Loofah and Rails::HTML::Sanitizer you'll see a few places where I've had to do if jruby branching because of parser differences.

When you say "wasn't there previously" I suspect this means "before Nokogiri switched to neko-htmlunit from the no-longer-supported (and insecure!) nekohtml in #2496 / v1.14.0.

Would you consider instead using an approach like assert_dom_equal from https://github.com/rails/rails-dom-testing which asserts on the equivalence of the DOM nodes (or a subtree thereof) rather than asserting on the bytes that are serialized by the parser? This is what I have always recommended because HTML serialization has always been fraught between the parsers.

You might also consider using pattern matching to assert on the key parts of the DOM structure (introduced in Nokogiri v1.14.0).

I'd also recommend, if you can, to use the HTML5 parser in Ronin (though, HTML5 support is not present in JRuby despite my requests for assistance which doesn't help your situation).

@flavorjones
Copy link
Member

And it may be interesting to note that neko-htmlunit's output on the test you reference matches the output of libgumbo (Nokogiri::HTML5's parser) which is a bit of a signal that the library maintainers have been trying to migrate its parsing and serialization behavior closer to HTML5 which is standardized by the spec (in a way that HTML4 is not specified).

@flavorjones
Copy link
Member

I'm happy to continue the conversation, but I'm going to mark the issue closed since there's nothing we can easily do in Nokogiri to address the issues you're seeing; and because of that difficulty, making the behavior of the JRuby impl exactly match the C impl is an anti-goal for the project.

@flavorjones flavorjones closed this as not planned Won't fix, can't repro, duplicate, stale Jan 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants