New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse internal entity declarations in internal DTD #367
Conversation
@niklasl thank you for the initiative, since this is still a draft PR I will not look into the details yet, but will approve the workflow runs. I agree that we need some tests for this stuff, if you need help to decide what to test or where to put them, just let me know. Are you aware of the issue regarding multiline doctype not being supported? If it does we might also be able to drop this workaround from our tests: xmldom/test/xmltest/valid.test.js Lines 17 to 19 in c3600ca
Don't be afraid if there are failing tests, this can also happen when we are improving our support of the XML spec. For a "final"/"reviewable" part of the PR I would prefer if there are no new TODO comments in the code and any new methods are properly documented (e.g. by copying method descriptions and links from the related specs). Please be also aware of #338 which separates XML from HTML parsing and might already have conflicting changes. But if your's is ready faster, I will be able to integrate your changes into that branch. And I'm not really actively working on that branch right now. So you chances are good. |
Another aspect that just occured to me: How relevant is what this produces when it is serialized?
Not sure how relevant those things are and I think we should again check how brosers are dealing with it. |
Co-authored-by: Christian Bewernitz <coder@karfau.de>
Thank you @karfau! I'll take a look at finalizing this. It's great to have your eyes on it as it stands though; so feel free to add more suggestions even in draft state. I'd like to make a small enough change for just the internal entity subset, but of course a consistent and thorough one at that. I'm not fully aware of the multiline doctype issue, but I suspect it's related to the TODO about adding a From my perspective, expanding the entities upon parsing and thus not keeping and serializing the entity declarations is fine. Good question about the doctype though, need to check how these changes affect that. (And yes, compare the behaviour with browser implementations.) |
It may have been altered by XML entity expansion.
So, throwing some XML at The serializer writes out the empty DTD though, which is valid but pointless I guess. We could add a check in that portion for empty publicID and systemID and just omit it, FWIW, either at parse or write time. Any opinions here? I'd suggest doing a more comprehensive DTD handler as a more major task. That might entail rewriting the parser a bit, as I believe e.g. extracting a (This fix only deals with simple string values for entities anyway, mostly because subparsing the string values as XML fragments is complex, not the least due to the XML Bomb risk if done carelessly. Writing a warning about that in the code makes sense too.) That'd make editing |
Expanding all entities and writing out an empty DTD is precisely what both Firefox and Chromium do too. Thus I'll keep this behaviour and adapt the tests to reflect this. |
This reflects that the internal XML entities are now parsed and expanded, and that an empty DOCTYPE is output. Although the xmltest specs (from James Clark) do not contain these empty DOCTYPE declarations, apparently Firefox and Chromium-based web browsers do.
These replace the TODO:s about adding a parseInternalDTD method (as that would require a more complex change).
This reflects the now added somewhat lax internal DOCTYPE processing, which now yields documents even with certain non-well-formed DOCTYPE:s.
Well this definitely sounds like an "attack vector"(?), there is nothing preventing somebody from putting an XML string into an entity (and I would guess that there is even an example for it in the xmltest folder/snapshots, so we should also have a test for that even if it just shows that everything is just properly escaped to not introduce new xml nodes this way. (If this is of course the specified behavior of what should happen, we need to document that this is currently not supported...) |
@@ -243,6 +248,11 @@ DOMHandler.prototype = { | |||
this.doc.doctype = dt; | |||
} | |||
}, | |||
|
|||
internalEntityDecl:function(name, value) { | |||
this.entityMap[name] = value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking out loud without checking any spec:
- Is it reasonable to allow overriding existing declarations?
The Name identifies the entity in an entity reference or, in the case of an unparsed entity, in the value of an ENTITY or ENTITIES attribute. If the same entity is declared more than once, the first declaration encountered is binding; at user option, an XML processor may issue a warning if entities are declared multiple times.
https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-entity-decl
- What about an entity with the name
prototype
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question and spec reference! Since the default entityMap
s defined in lib/entities.js
are frozen (using Object.freeze
from lib/conventions.js
), this kind of works right now; but only the predefined entities, not if repeating a declaration.
So I added an explicit check and redeclaration warning in 311ffc3.
The prototype is defined on the type, not in the prototype, so an undefined 'prototype'
key returns undefined
(and a &prototype;
entity can be defined in an entity declaration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the prototypical inheritance is an old manoeuvre which ought to be efficient. But at least moving it out to a utility could be wise. I did consider using Object.assign
here, but AFAIK that is an ES6 feature. (If a future overhaul decides to use e.g. hasOwnProperty
tests, or replace the entityMap with an ES6 Map
here, this would of course have to be reworked.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the default entityMaps defined in lib/entities.js are frozen (using Object.freeze from lib/conventions.js)
Please be aware it will not be frozen in a runtime where the Object.freeze
method is not available. So thank you for adding that check.
I think it would be good to add a test for that, and also the prototype entity, just to be sure it doesn't break in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the prototypical inheritance is an old manoeuvre which ought to be efficient. But at least moving it out to a utility could be wise. I did consider using
Object.assign
here, but AFAIK that is an ES6 feature. (If a future overhaul decides to use e.g.hasOwnProperty
tests, or replace the entityMap with an ES6Map
here, this would of course have to be reworked.)
I just played with this approach a bit and after checking how the entityMap
is currently used to replace entities filed the bug #370.
So once that bug is fixed your current solution no longer works.
I assume we will not get around copying the entities. (I wouldn't want to touch the sax parser more then absolutely required right now, since we have plans to get rid of it: #55.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mentioned bug was now fixed and the fix was released as 0.8.1.
Please update your branch (I can also do that if you prefer), after which I assume some tests would fail. The solution should be to copy the entitymap.
Since we can not just rely on Object.assign
I will provide something similar as part of lib/conventions.js
in the next days (in a separate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has been done as part of #379
var value = source.substring(declStart, delimEnd); | ||
// NOTE: This value is not further processed, but treated as PCDATA. | ||
// If recursive processing is implemented, ENSURE to avoid an XML Bomb | ||
// attack vector! | ||
domBuilder.internalEntityDecl(name, value); | ||
var nextTagStart = source.indexOf('<', end); | ||
var dtdEnd = source.indexOf(']>', end); | ||
if (dtdEnd < nextTagStart) { | ||
domBuilder.endDTD(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found the following section of the spec:
https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-predefined-ent
All XML processors must recognize these entities whether they are declared or not. For interoperability, valid XML documents should declare these entities, like any others, before using them. If the entities lt or amp are declared, they must be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped; the double escaping is required for these entities so that references to them produce a well-formed result. If the entities gt, apos, or quot are declared, they must be declared as internal entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is optional but harmless). For example:
<!ENTITY lt "&#60;">
<!ENTITY gt ">">
<!ENTITY amp "&#38;">
<!ENTITY apos "'">
<!ENTITY quot """>
Does the current implementation support this? (We should add a test if it's not already there.)
I didn't look into the tests yet, but I wonder if it could make sense to at least do a single replacement run as part of domBuilder.internalEntityDecl
since it has access to the existing entityMap
. Or would that already open up the "can of worms"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no, this specific recommended re-declaration isn't technically supported (and there is apparently no test for it). Of course these are predefined so they are already available (and I cannot recall seeing this re-declaration in practise, at least not internally in documents).
As I read this, it seems that while re-declarations are allowed to issue warnings, these specific characters (being the XML_ENTITIES
) are allowed to be re-declared by these exact values. That could be hard-coded, I guess, now or later on, to avoid the then improper warning if this is done.
And yes, I believe doing a replacement run on the entity values could open up that can. The Billion laughs attack doesn't rely upon recursion, but upon exponential growth, which this would enable. It should then be combined with e.g. monitoring of value growth, such as by carefully counting expansions, as libxml2 does with its entity reference loop detection (and corresponding --huge
option to xmllint
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for elaborating on this topic, very much appreciated.
After reading the Wikipedia article it became obvious.
I fully agree that there is no need for implementing this "exception" of the "first definition is binding" rule for XML entities, since the only allowed option is to redefine them to the same value, so in the worst case there is a warning about this not being supported, but the values still being correct.
I just started to think about (X)HTML and entities.
- XHTML both get's the html entities and is otherwise working like XML, so that shouldn't be a problem.
- Since the living HTML spec clearly defines what the doctype can contain:
https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
we should check what the browser does if it is provided a doctype with entities and the html mime type, and act accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my confusion, currently xmldom basically only knows about XHTML and treats HTML in the "same way".
So this will be part of #338
lib/sax.js
Outdated
var chunk = source.substring(start+8, end); | ||
var match = chunk.match(/\s+(\w+)\s+(["'])/); | ||
if (!match) { | ||
errorHandler.error("Malformed internal entity declaration"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Would that trigger for all the "unsupported" DTD cases?
Currently we are just ignoring those elements, if we now report an error, this might lead to "breaking behavior".
Which is a decision we can make, but we need to consider our options.
Or asked differently: Is there some other available entity syntax that we would consider "legit" but is actually not part of the supported cases and could cause any weird issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are absolutely right; there are other forms and only due to a sloppy regexp did this not trip up on those. I've corrected that in 0d665a5 and removed the comment in favour of just ignoring them for the time being.
"actual": "<!DOCTYPE doc><doc>A</doc>", | ||
"expected": "<doc>A</doc>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: This is something I might consider a "failing test", since previously the lib was yielding the expected result (not added to the snapshot twice when actual === expected
), and now there is additional stuff.
Of course we completely ignored/removed the doctype from the input, and now it's considered, so we might decie to live with it.
But I agree that this needs some checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From all the information I could find from the perspective of specs, I couldn't find anything related to dropping the DOCTYPE
under certain conditions:
https://w3c.github.io/DOM-Parsing/#dfn-xml-serializing-a-documenttype-node
I'm assuming the the "expected" value for some reason never contains the doctype, even if it' present in the input.
So from that perspective I'm assuming this isn't an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about that too; i.e. why the tests are so defined. It seems the behaviour now (to output the doctype declaration even if empty) is in line with what both libxml2 and browsers do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only clue I have is that the author of those test cases is also talking about "Canonical XML" and that this might be the reason why the doctype is not part of the expected output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following is part of the xmltest.zip
(file canonxml.html
) bundled as part of the xmltest
lib we use:
Canonical XML
This document defines a subset of XML called canonical XML. The intended use of canonical XML is in testing XML processors, as a representation of the result of parsing an XML document.
Every well-formed XML document has a unique structurally equivalent canonical XML document. Two structurally equivalent XML documents have a byte-for-byte identical canonical XML document. Canonicalizing an XML document requires only information that an XML processor is required to make available to an application.
A canonical XML document conforms to the following grammar:
CanonXML ::= Pi* element Pi*
element ::= Stag (Datachar | Pi | element)* Etag
Stag ::= '<' Name Atts '>'
Etag ::= '</' Name '>'
Pi ::= '<?' Name ' ' (((Char - S) Char*)? - (Char* '?>' Char*)) '?>'
Atts ::= (' ' Name '=' '"' Datachar* '"')*
Datachar ::= '&' | '<' | '>' | '"'
| '	'| ' '| ' '
| (Char - ('&' | '<' | '>' | '"' | #x9 | #xA | #xD))
Name ::= (see XML spec)
Char ::= (see XML spec)
S ::= (see XML spec)
Attributes are in lexicographical order (in Unicode bit order).
A canonical XML document is encoded in UTF-8.
Ignorable white space is considered significant and is treated equivalently to data.
No doctype present, so we should drop it from the test output after serialization (instead of the test input), to get closer to the expected values.
Make a more careful check for Parsed Entity (PEDef) forms to skip embedded markup.
var match = chunk.match(/^\s+(?:(%)\s+)?(\S+)\s+(["'])/); | ||
if (!match) { | ||
// NOTE: Ignoring unhandled forms of entity declarations | ||
return -1; | ||
} | ||
var declStart = start+8+match[0].length; | ||
var name = match[2]; | ||
var delim = match[3]; | ||
var delimEnd = source.indexOf(delim, declStart); | ||
if (delimEnd > end) { | ||
end = source.indexOf('>', delimEnd); | ||
} | ||
if (match[1] === '%') { | ||
// NOTE: Ignoring the PEDef form of entity declaration after forwarding | ||
// to the declaratino end. | ||
return end; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For landing this PR there are two things I would like to have added regarding this:
- add
errorHandler.warn
for the unsupported cases - unit tests for "all" the supported and unsupported "type of cases" we can think of.
I assume we should at least cover all the different cases that are present innode_modules/xmltest/xmltest.zip
, I can support you in collecting/unifying them if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm short on time at the moment alas, so I'd appreciate some pointers for collecting those cases, and/or suggestions for where to add new tests (even if that is just a link to some contribution guideline I may have missed).
I also agree that warnings are prudent for the range of unsupported entity forms. I just want to know how to balance that with the current "lenient/silent on unsupported" approach in other parts of this code. "Leave things in a better state than when entered" is perfectly viable here, I mostly want to know where to set the bar. (There is the risk of needing to handle the nested DOCTYPE state just to emit warnings. I think we can dodge that, to avoid increasing the SAX parser complexity if it is slated for a bigger overhaul or replacement.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started looking into the (number of) cases we need to test vs support.
Even though I'm not done yet, I start to have a feeling of what we are talking about.
I will dig more into "expanding" the top level of intSubset
, it's so crazy what is allowed in there...
Note:
- Anything in backticks uses the notation use din the xml1.1 spec as linked in the gist
- space literal not wrapped with backticks is equivalent to mandatory
S
, which includes line breaks) - The source of how I derived those values is at https://gist.github.com/karfau/e224d99441b556007c4b01b4f5f2f981
Regarding the !DOCTYPE
tag:
/// already supported
<!DOCTYPE `Name``S?`>
<!DOCTYPE `Name` SYSTEM "`[^"]*`"`S?`>
<!DOCTYPE `Name` SYSTEM '`[^']*`'`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" "`[^"]*`"`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' "`[^"]*`"`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" '`[^']*`'`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' '`[^']*`'`S?`>
/// newly supported
<!DOCTYPE `Name``S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` SYSTEM "`[^"]*`"`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` SYSTEM '`[^']*`'`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" "`[^"]*`"`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' "`[^"]*`"`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" '`[^']*`'`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' '`[^']*`'`S?`[`intSubset`]`S?`>
regarding the !ENTITY
declarations:
/// supported
<!ENTITY `Name` "`([^%&"] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`>
<!ENTITY `Name` '`([^%&'] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`>
// with every option just once:
<!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`>
<!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`>
<!ENTITY `Name` "%`Name`;"`S?`> || <!ENTITY `Name` '%`Name`;'`S?`>
<!ENTITY `Name` "&`Name`;"`S?`> || <!ENTITY `Name` '&`Name`;'`S?`>
<!ENTITY `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY `Name` '&#`[0-9]+`;'`S?`>
<!ENTITY `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY `Name` '&#x`[0-9a-fA-F]+`;'`S?`>
/// unsupported
<!ENTITY `Name` SYSTEM "`[^"]*`"`S?`>
<!ENTITY `Name` SYSTEM ''`[^']*`'`S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`"`S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`"`S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`'`S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`'`S?`>
<!ENTITY `Name` SYSTEM "`[^"]*`" NDATA `Name``S?`>
<!ENTITY `Name` SYSTEM "`[^']*`' NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`" NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`" NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`' NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`' NDATA `Name``S?`>
/// aka Parameter entities
<!ENTITY % `Name` "`([^%&"] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`>
<!ENTITY % `Name` '`([^%&'] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`>
// with every option just once:
<!ENTITY % `Name` ""`S?`> || <!ENTITY % `Name` ''`S?`>
<!ENTITY % `Name` "`[^%&"]*`"`S?`> || <!ENTITY % `Name` '`[^%&"]*`'`S?`>
<!ENTITY % `Name` "%`Name`;"`S?`> || <!ENTITY % `Name` '%`Name`;'`S?`>
<!ENTITY % `Name` "&`Name`;"`S?`> || <!ENTITY % `Name` '&`Name`;'`S?`>
<!ENTITY % `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY % `Name` '&#`[0-9]+`;'`S?`>
<!ENTITY % `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY % `Name` '&#x`[0-9a-fA-F]+`;'`S?`>
<!ENTITY % `Name` SYSTEM "`[^"]*`"`S?`>
<!ENTITY % `Name` SYSTEM '`[^']*`'`S?`>
<!ENTITY % `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`"`S?`>
<!ENTITY % `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`"`S?`>
<!ENTITY % `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`'`S?`>
<!ENTITY % `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`'`S?`>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think that https://github.com/xmldom/xmldom/blob/master/test/parse/doctype.test.js is a perfect place for those additional tests.
I assume that I will soon be able to provide some samples of what I have in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just invested at least another hour into doing the same for intSubset
, this is sooo crazy I don't want to go any deeper. I'm now quite confident that we have to pick the following strategy for anything unsupported (like the whole [intSubset]
was before: xmldom should continue to fail hard when anything unsupported / unexpected appears.
I also think that the current implementation doesn't really support everything I marked as supported in the ENTITY
part above, but rather the following is supported:
<!ENTITY `Name` "`([^%&"] | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`>
<!ENTITY `Name` '`([^%&'] | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`>
// with every option just once:
<!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`>
<!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`>
<!ENTITY `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY `Name` '&#`[0-9]+`;'`S?`>
<!ENTITY `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY `Name` '&#x`[0-9a-fA-F]+`;'`S?`>
or maybe even just
<!ENTITY `Name` "`([^%&"])*`"`S?`>
<!ENTITY `Name` '`([^%&'])*`'`S?`>
// with every option just once:
<!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`>
<!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`>
and everything else fails as before.
And if we don't find a reliable way to know when the intSubset
/ DOCTYPE
ends when doing that, I would argue this could even be a fatalError
/ ParseError
which throws by default.
What do you think?
And I think it's important to say that: Thank you again and again @niklasl for investing that time. I'm still hesitant to actually contribute code/tests in this PR, since the topic is not a high prio for us. If at some point you are not able/willing to continue on this PR, just make it explicit that further changes need to come from somebody else. |
since we can not rely on it being present in all supported runtimes. Even though the interface is the same as `Object.assign`, it behaves slightly differently from the one provided by browsers. This was extracted from xmldom#338 to support development in xmldom#367
👋 I have been evaluating this wonderful library, and this PR in particular. Thanks so much to all of you for wonderful, reliable open-source library. I am interested in using this for a project to parse, serialize, and de-serialize data models in XML, but the production copies make use of external entities. Here is one such example. So, a few questions about this PR:
(I totally understand why if you don't!) |
Thank you for asking @aj-stein-nist.
I'm really torn here. I was hoping that we would be able to finish and merge this PR before I'm getting that close to land a different huge PR that changes a lot of things, which will make it difficult to resolve the upcoming conflicts. But it doesn't seem to be the case, as there has not been a reply from @niklasl for two month. And it's sad, but I don't even remember all the feedback that would need to be resolved, but I think there is quite a bunch. I'm not able to work on this topic, but if somebody wants to pick this up we might find a way to push the changes into this PR/branch. (If it helps I could also recreate this branch in the main repo.) The main "concern" with all this entity support is that we are planning to drop the current sax implementation and replace it a with a dependency that is maintained by somebody else. And as far as I remember, even the most promising candidates for that do not have any entity support.
I think #117 is the one to use for further discussion.
The second main concern with external entities is that we need to handle "How to resolve external references to files or definition?" in a way that it can be plugged in as a handler if needed, and we would still need to ignore those things when no such handler is provided. I hope this answers enough of your questions so you can decide what to do about it. PS: I think I will close this PR if there is no further information about possible contributors working on this topic within a month. |
You're welcome!
Oh ok. 😢
Ironically, I am experimenting with
This is the juicy bit, be it this library or another, that I am trying to work around in my mind to understand how to make it pluggable not completely gut an existing library or "just start over." That's why I came here to consult first.
Understandable. I will continue discussion in #117 as needed about future work, irrespective of what happens in this PR, but continue to follow progress here of course. 👋 |
Yes, that is what I was trying to link to |
I'm sorry about not being able to finish this. As I do not actively use this library and am quite swamped at work (as usual), I haven't had any dedicated time for further work here. I'm also torn since I was hoping the changes as they are were mostly better than not being able to parse internal entities at all. The added requirements to land this are solid in themselves, but this library doesn't seem to have the plumbing in place for those requirements, and as it stands today silently ignores internal DTD parts, so it is much more of an undertaking I suspect. That seems to require a level of dedication (in my spare time) that I've found hard to motivate given that it seems this might be scrapped when the rewrite takes place? |
@niklasl Thank you for taking all the time you did and also for taking the time to respond here. Just in case anybody want's to pick this up anyways, I decided to push your branch into the main repo as is: https://github.com/xmldom/xmldom/tree/internal-entity-declarations So you don't have to keep this branch/repo of yours, but the current state of the changes is preserved and can be "forked". |
Thank you @karfau I think this is reasonable. I may have another go at this, targeting the new sax parser once it is in place (either by contributing to that, or, depending on its level of extensibility, by configuring it from within xmldom). Is it possible to note in #117 when the progress of integrating the new parser is stable enough for such an endeavour to begin? (I suppose it might take time if #55 is still on the drawing board.) |
@niklasl @aj-stein-nist FYI we are currently working on a related PR: #498 |
This handles the basic (and fairly common) use of internal DTD entity declarations in RDF/XML. Example:
Note that this needs:
When ready, this should solve the basic case of #117.