Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] After encountering 100 unknown XML entities, the SAX parser stops calling Nokogiri::XML::SAX::Document#error #3147

Open
searls opened this issue Mar 12, 2024 · 6 comments
Labels
state/needs-triage Inbox for non-installation-related bug reports or help requests

Comments

@searls
Copy link

searls commented Mar 12, 2024

Ever since filing #1926 in 2019, my JMDict-parsing gem eiwa has been relying on the SAX document error(msg) callback to identify entities, like this:

class Doc < Nokogiri::XML::SAX::Document
  # …
  def error(msg)
    if (matches = msg.match(/Entity '(\S+)' not defined/))
      # See: http://github.com/sparklemotion/nokogiri/issues/1926
      code = matches[1]
      @current.entity_code = code
    else
      raise Eiwa::Error.new("Parsing error: #{msg}")
    end
  end
end

Recently, after updating Nokogiri for the first time in several years, I started receiving a flurry of bug reports from users that the dictionary entries were wrong in confusing/complicated ways. I was able to track it down to the fact that after encountering 100 total (as opposed to distinct) unknown XML entities, the SAX parser will stop calling the error callback, effectively causing the rest to be nil, since I have no other way to detect them.

Help us reproduce what you're seeing

Here's an XML file named repro.xml:

<?xml version="1.0" encoding="UTF-8"?>
<elements>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <entity>&blah;</entity>
  <!-- ^ 100 entities ^ -->
  <entity>&blah;</entity>
  <!-- ^ 🫥 ^ -->
</elements>

And here's a Ruby script that parses it with SAX:

require "nokogiri"

class Tag
  attr_reader :tag_name, :parent
  attr_accessor :entity_code

  def start(tag_name, parent)
    @tag_name = tag_name
    @parent = parent
  end
end

class Doc < Nokogiri::XML::SAX::Document
  attr_reader :elements

  def initialize
    @elements = []
    @current = nil
  end

  def start_document
  end

  def end_document
  end

  def start_element(name, _attrs)
    parent = @current
    @current = Tag.new
    @current.start(name, parent)
  end

  def end_element(name)
    raise "Parsing error. Expected <#{@current.tag_name}> to close before <#{name}>" if @current.tag_name != name
    ending = @current
    if ending.tag_name == "entity"
      @elements << ending
    end

    @current = ending.parent
  end

  def error(msg)
    if (matches = msg.match(/Entity '(\S+)' not defined/))
      # See: http://github.com/sparklemotion/nokogiri/issues/1926
      code = matches[1]
      @current.entity_code = code
    else
      raise Eiwa::Error.new("Parsing error: #{msg}")
    end
  end
end

doc = Doc.new.tap do |doc|
  Nokogiri::XML::SAX::Parser.new(doc).parse_file("repro.xml") do |ctx|
    ctx.recovery = true
  end
end

entity_codes = doc.elements.map(&:entity_code)

if entity_codes.all? { |code| code == "blah" }
  puts "Working! All #{entity_codes.size} entity codes are #{entity_codes.first.inspect}."
elsif entity_codes.any?(&:nil?)
  warn <<~MSG
    Reproduced! Nil entity codes were found.

    Nil entities: #{entity_codes.count(&:nil?)}
    Non-nil entities: #{entity_codes.count { |e| !e.nil? }}
  MSG
  exit 1
end

When I run it, I get this output:

$ ruby repro.rb 
Reproduced! Nil entity codes were found.

Nil entities: 1
Non-nil entities: 100

Environment

@searls searls added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Mar 12, 2024
@searls searls changed the title [bug] After 100 unknown XML entities are encountered by the SAX parser, it stops calling Nokogiri::XML::SAX::Document#error [bug] After encountering 100 unknown XML entities, the SAX parser stops calling Nokogiri::XML::SAX::Document#error Mar 12, 2024
@flavorjones
Copy link
Member

Thanks for reporting. I'll bisect and see when libxml2 introduced the behavior change and whether we can change it.

@flavorjones
Copy link
Member

This behavior changed between Nokogiri v1.14.5 and v1.15.0, which means it's likely to have changed in libxml v2.11.x. Will now git bisect libxml2.

@flavorjones
Copy link
Member

Upstream commit is https://gitlab.gnome.org/GNOME/libxml2/-/commit/59b33661784359c6d3a8309ddbd2129fb2688548:

commit 59b33661784359c6d3a8309ddbd2129fb2688548
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Tue Dec 27 14:15:51 2022 +0100

    error: Limit number of parser errors
    
    Reporting errors is expensive and some abusive test cases can generate
    an error for each invalid input byte. This causes the parser to spend
    most of the time with error handling. Limit the number of errors and
    warnings to 100.

@flavorjones
Copy link
Member

@searls is it fair to say that if I fixed #1926 a) that would be preferable, and b) we could close this issue?

@searls
Copy link
Author

searls commented Mar 12, 2024

@flavorjones yes, for sure. #1926 is the only reason a SAX parser would need to rely on error callbacks

(I do worry if the fix to #1926 doesn't intercept these errors, documents like these will still hit the error limit, which isn't helpful if there's a bona fide parse error later in the document)

Also, despite my mentioning "unknown" entities, IME the error callback still fires and is still the only way to track entities that are also declared in DTD frontmatter in a document, if that's any help

@flavorjones
Copy link
Member

OK we can pick this up on #1926 but I'm going to try to extend Aaron's fix there with Nick's advice and see if I can make it behave reasonably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state/needs-triage Inbox for non-installation-related bug reports or help requests
Projects
None yet
Development

No branches or pull requests

2 participants