New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] After encountering 100 unknown XML entities, the SAX parser stops calling Nokogiri::XML::SAX::Document#error #3147
Comments
Thanks for reporting. I'll bisect and see when libxml2 introduced the behavior change and whether we can change it. |
This behavior changed between Nokogiri v1.14.5 and v1.15.0, which means it's likely to have changed in libxml v2.11.x. Will now git bisect libxml2. |
Upstream commit is https://gitlab.gnome.org/GNOME/libxml2/-/commit/59b33661784359c6d3a8309ddbd2129fb2688548:
|
@flavorjones yes, for sure. #1926 is the only reason a SAX parser would need to rely on error callbacks (I do worry if the fix to #1926 doesn't intercept these errors, documents like these will still hit the error limit, which isn't helpful if there's a bona fide parse error later in the document) Also, despite my mentioning "unknown" entities, IME the error callback still fires and is still the only way to track entities that are also declared in DTD frontmatter in a document, if that's any help |
OK we can pick this up on #1926 but I'm going to try to extend Aaron's fix there with Nick's advice and see if I can make it behave reasonably. |
Ever since filing #1926 in 2019, my JMDict-parsing gem eiwa has been relying on the SAX document
error(msg)
callback to identify entities, like this:Recently, after updating Nokogiri for the first time in several years, I started receiving a flurry of bug reports from users that the dictionary entries were wrong in confusing/complicated ways. I was able to track it down to the fact that after encountering 100 total (as opposed to distinct) unknown XML entities, the SAX parser will stop calling the
error
callback, effectively causing the rest to benil
, since I have no other way to detect them.Help us reproduce what you're seeing
Here's an XML file named
repro.xml
:And here's a Ruby script that parses it with SAX:
When I run it, I get this output:
Environment
The text was updated successfully, but these errors were encountered: