Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] extra </span> inserted after Nokogiri::HTML.parse #2796

Closed
seanstory opened this issue Feb 21, 2023 · 2 comments
Closed

[bug] extra </span> inserted after Nokogiri::HTML.parse #2796

seanstory opened this issue Feb 21, 2023 · 2 comments

Comments

@seanstory
Copy link

Please describe the bug

I'm attempting to parse html content from a site I do not control. Specifically, https://2e.aonprd.com.
I'm taking the raw HTMl and attempting to extract text content, excluding common header and footer text, to use for a search usecase. My plan was to do something like:

html_content = get_html_content(url)
parsed_data = Nokogiri::HTML.parse(html_content)
text_i_care_about = parsed_data.at_css('[id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput"]').text # forgive the long selector

However, I've noticed that some pages are getting "" content.

When I go to browser dev tools and access text by the selector with $$("#ctl00_RadDrawer1_Content_MainContent_DetailedOutput").map(e=>e.textContent), I get the text I'm expecting, so it's not an issue with having gotten the selector wrong.

When I step through with irb, I can see that an extra </span> is being inserted right before the text content, so that the result of the Nokogiri::HTML.parse is closing my identfied element early. I'll attach the raw HTML, for an example page of https://2e.aonprd.com/(X(1)S(jjv5qg45qaziuq55lopb3o45))/Classes.aspx?ID=1

html.zip

Help us reproduce what you're seeing

#! /usr/bin/env ruby

require 'nokogiri'

content = File.read('raw.html')
parsed_data = Nokogiri::HTML.parse(content)
body_content = parsed_data.at_css('[id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput"]')
puts body_content.text # the empty string
puts parsed_data # scroll up, and see that right after `ctl00_RadDrawer1_Content_MainContent_DetailedOutput` there's loads of text, but a new, erroneous </span> has been added right before it

Expected behavior

Nokogiri shouldn't add extra closing tags

Environment
OSX 13.2.1
Platform: arm64-darwin
reproduced in:

  • JRuby 9.3.3.0, Nokogiri 1.13.10
  • MRI Ruby 2.6.9, Nokogiri 1.13.4
  • MRI Ruby 2.7.7, Nokogiri 1.14.2
# Nokogiri (1.14.2)
    ---
    warnings: []
    nokogiri:
      version: 1.14.2
      cppflags:
      - "-I/opt/homebrew/Cellar/rbenv/1.2.0/versions/2.7.7/lib/ruby/gems/2.7.0/gems/nokogiri-1.14.2-arm64-darwin/ext/nokogiri"
      - "-I/opt/homebrew/Cellar/rbenv/1.2.0/versions/2.7.7/lib/ruby/gems/2.7.0/gems/nokogiri-1.14.2-arm64-darwin/ext/nokogiri/include"
      - "-I/opt/homebrew/Cellar/rbenv/1.2.0/versions/2.7.7/lib/ruby/gems/2.7.0/gems/nokogiri-1.14.2-arm64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.7.7
      platform: arm64-darwin22
      gem_platform: arm64-darwin-22
      description: ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [arm64-darwin22]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      libxml2_path: "/opt/homebrew/Cellar/rbenv/1.2.0/versions/2.7.7/lib/ruby/gems/2.7.0/gems/nokogiri-1.14.2-arm64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.10.3
      loaded: 2.10.3
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      datetime_enabled: true
      compiled: 1.1.37
      loaded: 1.1.37
    other_libraries:
      zlib: 1.2.13
      libiconv: '1.17'
      libgumbo: 1.0.0-nokogiri
@seanstory seanstory added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Feb 21, 2023
@flavorjones
Copy link
Member

flavorjones commented Feb 22, 2023

@seanstory Sorry you're having a problem. I'll try to explain what's going on here. In summary, the HTML you're parsing is not well-formed, and so parsers will try to "fix it up".

Notably, HTML4 does not have a specification for how "fixing up" should be done, and so parsers may all do different things. But HTML5 does have a "fix up" spec, so if you want to match modern browser behavior you should use Nokogiri::HTML5 and not Nokogiri::HTML

Here's the start of the markup from raw.html that you're trying to operate on:

<span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput"><h1 class="title"><a href ="PFS.aspx"><span style="float:left;"><img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png"></a></span>Alchemist</h1>...

Let me format that better so you can see the structure more clearly:

  <html>
    <body>
      <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
        <h1 class="title">
          <a href="PFS.aspx">
            <span style="float:left;">
              <img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png">
            </a>
          </span>
          Alchemist
        </h1>
    </body>
  </html>

You should be able to see pretty clearly that the opening and closing tags are mismatched. When the parser sees the closing </a> tag, it will auto-close any other tags that were enclosed in that a element, which includes the span. Later, when it sees the closing </span> tag it auto-closes the h1. Finally,when it sees the </h1> tag it can't find a matching opening tag and drops it.

Click here to see some working code to demonstrate what's happening.
#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
end

html = <<~HTML
  <html>
    <body>
      <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
        <h1 class="title">
          <a href="PFS.aspx">
            <span style="float:left;">
              <img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png">
            </a>
          </span>
          Alchemist
        </h1>
    </body>
  </html>
HTML

doc = Nokogiri::HTML4::Document.parse(html)

doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
#    "<html>\n" +
#    "  <body>\n" +
#    "    <span id=\"ctl00_RadDrawer1_Content_MainContent_DetailedOutput\">\n" +
#    "      <h1 class=\"title\">\n" +
#    "        <a href=\"PFS.aspx\">\n" +
#    "          <span style=\"float:left;\">\n" +
#    "            <img alt=\"PFS Standard\" title=\"PFS Standard\" style=\"height:25px; padding:2px 10px 0px 2px\" src=\"ImagesIconsPFS_Standard.png\">\n" +
#    "          </span></a>\n" +
#    "        </h1></span>\n" +
#    "        Alchemist\n" +
#    "      \n" +
#    "  </body>\n" +
#    "</html>\n"

doc.errors
# => [#<Nokogiri::XML::SyntaxError: 11:12: ERROR: Unexpected end tag : h1>]

So the final, corrected markup will look like:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
      <h1 class="title">
        <a href="PFS.aspx">
          <span style="float:left;">
            <img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="ImagesIconsPFS_Standard.png">
          </span></a>
        </h1></span>
        Alchemist
      
  </body>
</html>

But note that libgumbo (Nokogiri::HTML5 on CRuby) corrects this differently! And possibly the same way your browser fixes it up.

Click here to see more code demonstrating the HTML5 behavior.
#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
end

html = <<~HTML
  <html>
    <body>
      <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
        <h1 class="title">
          <a href="PFS.aspx">
            <span style="float:left;">
              <img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png">
            </a>
          </span>
          Alchemist
        </h1>
    </body>
  </html>
HTML

doc = Nokogiri::HTML5::Document.parse(html, max_errors: 10)

doc.to_html
# => "<html><head></head><body>\n" +
#    "    <span id=\"ctl00_RadDrawer1_Content_MainContent_DetailedOutput\">\n" +
#    "      <h1 class=\"title\">\n" +
#    "        <a href=\"PFS.aspx\">\n" +
#    "          <span style=\"float:left;\">\n" +
#    "            <img alt=\"PFS Standard\" title=\"PFS Standard\" style=\"height:25px; padding:2px 10px 0px 2px\" src=\"ImagesIconsPFS_Standard.png\">\n" +
#    "          </span></a>\n" +
#    "        \n" +
#    "        Alchemist\n" +
#    "      </h1>\n" +
#    "  \n" +
#    "\n" +
#    "</span></body></html>"

doc.errors
# => [#<Nokogiri::XML::SyntaxError:"1:1: ERROR: Expected a doctype token\n<html>\n^">,
#     #<Nokogiri::XML::SyntaxError:"8:11: ERROR: That tag isn't allowed here  Currently open tags: html, body, span, h1, a, span.\n          </a>\n          ^">,
#     #<Nokogiri::XML::SyntaxError:"9:9: ERROR: That tag isn't allowed here  Currently open tags: html, body, span, h1.\n        </span>\n        ^">,
#     #<Nokogiri::XML::SyntaxError:"12:3: ERROR: That tag isn't allowed here  Currently open tags: html, body, span.\n  </body>\n  ^">]

And the parsed HTML5 DOM looks like:

<html><head></head><body>
    <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
      <h1 class="title">
        <a href="PFS.aspx">
          <span style="float:left;">
            <img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="ImagesIconsPFS_Standard.png">
          </span></a>
        
        Alchemist
      </h1>
  

</span></body></html>

I hope all this makes sense! What questions do you have for me?

@flavorjones flavorjones added meta/user-help and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Feb 22, 2023
@seanstory
Copy link
Author

@flavorjones thanks for responding so fast! This explanation makes sense, thank you so much for the help. I'm bummed that this solution isn't available for JRuby, but I see there's an open issue for that, so maybe one day. 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants