Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failing test case to keep entities in sax parsers #1500

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rosenfeld
Copy link
Contributor

Could you please help me getting this test to pass? cc/ @tenderlove

@rosenfeld
Copy link
Contributor Author

Hi @tenderlove, I think I understand what this issue is all about. libxml2 provides the getEntity callback but Nokogiri doesn't expose it. If it exposed I could override the default behavior to simply return the raw data so that it would be passed to the characters callback. Here's what Nokogiri currently supports:

https://github.com/sparklemotion/nokogiri/blob/master/ext/nokogiri/xml_sax_parser.c#L266-L278

Here's the missing callback function:

https://github.com/GNOME/libxml2/blob/master/include/libxml/parser.h#L725

This article explains how getEntity works (see "The getEntity Callback" section):

http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html#entities

You may have been wondering how entities (eg <, etc) are handled by the SAX interface. This is done by the getEntity callback... After calling getEntity, the expansion of the entity is passed to the characters callback. This way, you do not need to worry about decoding entities anywhere else in your callback routines.

The default implementation seems to check for either ctxt->validate or ctxt->replaceEntities, so maybe setting replaceEntities to true is not enough if ctxt->validate is true.

Maybe it would be simpler if Nokogiri could simply add the getEntity callback available to the SAX parser so that we could override the default behavior to simply return the raw string. I've only searched through the sources and documentation, but I don't actually know how to implement this, so maybe what I'm saying is non-sense?

flavorjones added a commit that referenced this pull request Jan 13, 2017
see #1500 for background

libxml (MRI):

* the callback gets invoked properly,
* but cannot be controlled by `replace_entities`.
* need to look into why replace_entities isn't working right. libxml's
  docs are pretty thin.

xerces (JRuby):

* the callback gets invoked _sometimes_,
* and cannot be controlled by `replace_entities`.
* need to look into how to control it; and why the callback isn't
  invoked the same way that libxml2 invokes its handler.
@flavorjones
Copy link
Member

flavorjones commented Jan 13, 2017

@rosenfeld I've made an attempt to get this to work; unfortunately I can't quite get it to work correctly with either libxml2 or xerces.

feature branch is flavorjones-sax-parser-replace-entities

Any thoughts? If you've got bandwidth, I'd love for you to look into finishing this work.

@rosenfeld
Copy link
Contributor Author

Hi @flavorjones. I do have bandwidth, this is not the problem. Time is the problem. I currently don't have time to work on my off hours since I have two little girls that demand all my attention when I'm not working. I'd love to spend some time during my work investigating this issue and trying to get it working into Nokogiri, but to be honest, there are currently so many items in our backlogs that I suspect it will be a long time before I'm able to work on this.

All of that is new to me, which means it will take me several days at best to understand the pieces. I have never used libxml2 before, I have little knowledge about XML in general, I have never worked on C-extension gems before and after checking the documentation for libxml2 it seems there's not much there regarding entities substitutions in SAX parsers either. That means, I'll need some time to read through libxml2 source to understand how it works, try to get some simple C-only program to try that replaceEntities option (which may force me to understand libxml2 internal source), learn how Nokogiri uses libxml2, then finally debug what is needed to get it to work as expected. All of this sounds like very interesting work actually to be honest and I'd love to do that. But I don't think I'll have that much free time during my work time for a few months at least... I spent the past few hours having a glance over libxml2 docs and source, Nokogiri sources, and some other xml sax parsers for Ruby and I think I have now a better idea on how much time will be required for me to focus on this.

I just wanted you to know that I really appreciate your effort on this feature and that I do intend to help with that, but I just wanted to warn you that it can take a long time. It doesn't mean I forgot about this issue, this is all I wanted you to understand.

Thank you very much for this branch, I hope I can find some time to finish this work some day.

Base automatically changed from master to main January 17, 2021 21:52
flavorjones added a commit that referenced this pull request Aug 20, 2021
see #1500 for background

libxml (MRI):

* the callback gets invoked properly,
* but cannot be controlled by `replace_entities`.
* need to look into why replace_entities isn't working right. libxml's
  docs are pretty thin.

xerces (JRuby):

* the callback gets invoked _sometimes_,
* and cannot be controlled by `replace_entities`.
* need to look into how to control it; and why the callback isn't
  invoked the same way that libxml2 invokes its handler.
@flavorjones
Copy link
Member

I've rebased my branch and re-pushed it to 1500-sax-parser-replace-entities for potential further investigation

flavorjones added a commit that referenced this pull request Sep 24, 2021
see #1500 for background

libxml (MRI):

* the callback gets invoked properly,
* but cannot be controlled by `replace_entities`.
* need to look into why replace_entities isn't working right. libxml's
  docs are pretty thin.

xerces (JRuby):

* the callback gets invoked _sometimes_,
* and cannot be controlled by `replace_entities`.
* need to look into how to control it; and why the callback isn't
  invoked the same way that libxml2 invokes its handler.
flavorjones added a commit that referenced this pull request Sep 24, 2021
see #1500 for background

libxml (MRI):

* the callback gets invoked properly,
* but cannot be controlled by `replace_entities`.
* need to look into why replace_entities isn't working right. libxml's
  docs are pretty thin.

xerces (JRuby):

* the callback gets invoked _sometimes_,
* and cannot be controlled by `replace_entities`.
* need to look into how to control it; and why the callback isn't
  invoked the same way that libxml2 invokes its handler.
@flavorjones
Copy link
Member

See related #1926

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants