New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add failing test case to keep entities in sax parsers #1500
base: main
Are you sure you want to change the base?
Conversation
Hi @tenderlove, I think I understand what this issue is all about. libxml2 provides the getEntity callback but Nokogiri doesn't expose it. If it exposed I could override the default behavior to simply return the raw data so that it would be passed to the characters callback. Here's what Nokogiri currently supports: https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_sax_parser.c#L266-L278 Here's the missing callback function: https://github.com/GNOME/libxml2/blob/master/include/libxml/parser.h#L725 This article explains how getEntity works (see "The getEntity Callback" section): http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html#entities
The default implementation seems to check for either Maybe it would be simpler if Nokogiri could simply add the getEntity callback available to the SAX parser so that we could override the default behavior to simply return the raw string. I've only searched through the sources and documentation, but I don't actually know how to implement this, so maybe what I'm saying is non-sense? |
see #1500 for background libxml (MRI): * the callback gets invoked properly, * but cannot be controlled by `replace_entities`. * need to look into why replace_entities isn't working right. libxml's docs are pretty thin. xerces (JRuby): * the callback gets invoked _sometimes_, * and cannot be controlled by `replace_entities`. * need to look into how to control it; and why the callback isn't invoked the same way that libxml2 invokes its handler.
@rosenfeld I've made an attempt to get this to work; unfortunately I can't quite get it to work correctly with either libxml2 or xerces. feature branch is Any thoughts? If you've got bandwidth, I'd love for you to look into finishing this work. |
Hi @flavorjones. I do have bandwidth, this is not the problem. Time is the problem. I currently don't have time to work on my off hours since I have two little girls that demand all my attention when I'm not working. I'd love to spend some time during my work investigating this issue and trying to get it working into Nokogiri, but to be honest, there are currently so many items in our backlogs that I suspect it will be a long time before I'm able to work on this. All of that is new to me, which means it will take me several days at best to understand the pieces. I have never used libxml2 before, I have little knowledge about XML in general, I have never worked on C-extension gems before and after checking the documentation for libxml2 it seems there's not much there regarding entities substitutions in SAX parsers either. That means, I'll need some time to read through libxml2 source to understand how it works, try to get some simple C-only program to try that replaceEntities option (which may force me to understand libxml2 internal source), learn how Nokogiri uses libxml2, then finally debug what is needed to get it to work as expected. All of this sounds like very interesting work actually to be honest and I'd love to do that. But I don't think I'll have that much free time during my work time for a few months at least... I spent the past few hours having a glance over libxml2 docs and source, Nokogiri sources, and some other xml sax parsers for Ruby and I think I have now a better idea on how much time will be required for me to focus on this. I just wanted you to know that I really appreciate your effort on this feature and that I do intend to help with that, but I just wanted to warn you that it can take a long time. It doesn't mean I forgot about this issue, this is all I wanted you to understand. Thank you very much for this branch, I hope I can find some time to finish this work some day. |
see #1500 for background libxml (MRI): * the callback gets invoked properly, * but cannot be controlled by `replace_entities`. * need to look into why replace_entities isn't working right. libxml's docs are pretty thin. xerces (JRuby): * the callback gets invoked _sometimes_, * and cannot be controlled by `replace_entities`. * need to look into how to control it; and why the callback isn't invoked the same way that libxml2 invokes its handler.
I've rebased my branch and re-pushed it to |
see #1500 for background libxml (MRI): * the callback gets invoked properly, * but cannot be controlled by `replace_entities`. * need to look into why replace_entities isn't working right. libxml's docs are pretty thin. xerces (JRuby): * the callback gets invoked _sometimes_, * and cannot be controlled by `replace_entities`. * need to look into how to control it; and why the callback isn't invoked the same way that libxml2 invokes its handler.
see #1500 for background libxml (MRI): * the callback gets invoked properly, * but cannot be controlled by `replace_entities`. * need to look into why replace_entities isn't working right. libxml's docs are pretty thin. xerces (JRuby): * the callback gets invoked _sometimes_, * and cannot be controlled by `replace_entities`. * need to look into how to control it; and why the callback isn't invoked the same way that libxml2 invokes its handler.
See related #1926 |
Could you please help me getting this test to pass? cc/ @tenderlove