Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse internal entity declarations in internal DTD #367

Closed
wants to merge 9 commits into from

Conversation

niklasl
Copy link

@niklasl niklasl commented Jan 22, 2022

This handles the basic (and fairly common) use of internal DTD entity declarations in RDF/XML. Example:

<?xml version="1.0"?>
<!DOCTYPE rdf:RDF [
    <!ENTITY rdf 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
    <!ENTITY sdo 'http://schema.org/'>
]>
<rdf:RDF xmlns:rdf="&rdf;"
         xmlns="&sdo;"
         xml:base="http://example.org/">
    <CreativeWork rdf:about="/doc">
        <name xml:lang="en">A Document &amp; a &quot;name&quot;</name>
        <dateCreated rdf:datatype="&sdo;Date">2016-07-10T14:34:50+0200</dateCreated>
    </CreativeWork>
</rdf:RDF>

Note that this needs:

  • Some unit testing. (I just threw the case above at my local instance of LDTR, using this modified xmldom code.)
  • Overseeing the way I handled the entityMap.
  • Actually matching the root element (QName) to be conformant.
  • Ensurance that this doesn't lead to an XML Bomb attack vector. Tthis specific handling is non-recursive and safe as written. But if logic for external references or recursive entities would be added this quickly gets complex and dangerous.

When ready, this should solve the basic case of #117.

@karfau
Copy link
Member

karfau commented Jan 22, 2022

@niklasl thank you for the initiative, since this is still a draft PR I will not look into the details yet, but will approve the workflow runs.

I agree that we need some tests for this stuff, if you need help to decide what to test or where to put them, just let me know.

Are you aware of the issue regarding multiline doctype not being supported?
I'm currently assuming that your PR might also need to solve that (if it doesn't already).

If it does we might also be able to drop this workaround from our tests:

// TODO: The DOCTYPE totally confuses xmldom :sic:
// for now we remove it and any newlines after it so we have reasonable tests
.replace(/^<!DOCTYPE doc \[[^\]]+]>[\r\n]*/m, '')

Don't be afraid if there are failing tests, this can also happen when we are improving our support of the XML spec.

For a "final"/"reviewable" part of the PR I would prefer if there are no new TODO comments in the code and any new methods are properly documented (e.g. by copying method descriptions and links from the related specs).

Please be also aware of #338 which separates XML from HTML parsing and might already have conflicting changes. But if your's is ready faster, I will be able to integrate your changes into that branch. And I'm not really actively working on that branch right now. So you chances are good.

lib/sax.js Outdated Show resolved Hide resolved
@karfau
Copy link
Member

karfau commented Jan 22, 2022

Another aspect that just occured to me: How relevant is what this produces when it is serialized?

  • Are we fine withe the entities no longer being used?
  • Is the doctype serialized as is?

Not sure how relevant those things are and I think we should again check how brosers are dealing with it.

Co-authored-by: Christian Bewernitz <coder@karfau.de>
@niklasl
Copy link
Author

niklasl commented Jan 22, 2022

Thank you @karfau! I'll take a look at finalizing this. It's great to have your eyes on it as it stands though; so feel free to add more suggestions even in draft state. I'd like to make a small enough change for just the internal entity subset, but of course a consistent and thorough one at that.

I'm not fully aware of the multiline doctype issue, but I suspect it's related to the TODO about adding a parseInternalDTD? I agree that this part ought to be done before merging.

From my perspective, expanding the entities upon parsing and thus not keeping and serializing the entity declarations is fine. Good question about the doctype though, need to check how these changes affect that. (And yes, compare the behaviour with browser implementations.)

@niklasl
Copy link
Author

niklasl commented Jan 23, 2022

So, throwing some XML at xmllint --noent indicates that we might get away with adding the entities to the simple entityMap; thus not having to map the entities to a named doctype and picking that with the root element name. I think this simplifies the minimum fix here.

The serializer writes out the empty DTD though, which is valid but pointless I guess. We could add a check in that portion for empty publicID and systemID and just omit it, FWIW, either at parse or write time. Any opinions here?

I'd suggest doing a more comprehensive DTD handler as a more major task. That might entail rewriting the parser a bit, as I believe e.g. extracting a parseInternalDTD from parseDCC needs nested state handling. I'll probably remove the TODO for that instead to avoid making this PR more involved than called for. (The conditional (state-delegated) endDTD call is the one thing I would have liked to improve on, but it calls for this additional level of parser state to handle properly, I think.)

(This fix only deals with simple string values for entities anyway, mostly because subparsing the string values as XML fragments is complex, not the least due to the XML Bomb risk if done carelessly. Writing a warning about that in the code makes sense too.)

That'd make editing test/xmltest/valid.test.js and test/xmltest/__snapshots__/valid.test.js.snap to pass the one thing left before this is ready for review, from my point of view.

@niklasl
Copy link
Author

niklasl commented Jan 23, 2022

Expanding all entities and writing out an empty DTD is precisely what both Firefox and Chromium do too. Thus I'll keep this behaviour and adapt the tests to reflect this.

This reflects that the internal XML entities are now parsed and
expanded, and that an empty DOCTYPE is output.

Although the xmltest specs (from James Clark) do not contain these empty
DOCTYPE declarations, apparently Firefox and Chromium-based web browsers
do.
These replace the TODO:s about adding a parseInternalDTD method (as that
would require a more complex change).
This reflects the now added somewhat lax internal DOCTYPE processing,
which now yields documents even with certain non-well-formed DOCTYPE:s.
@niklasl niklasl marked this pull request as ready for review January 23, 2022 18:09
@karfau
Copy link
Member

karfau commented Jan 24, 2022

(This fix only deals with simple string values for entities anyway, mostly because subparsing the string values as XML fragments is complex, not the least due to the XML Bomb risk if done carelessly. Writing a warning about that in the code makes sense too.)

Well this definitely sounds like an "attack vector"(?), there is nothing preventing somebody from putting an XML string into an entity (and I would guess that there is even an example for it in the xmltest folder/snapshots, so we should also have a test for that even if it just shows that everything is just properly escaped to not introduce new xml nodes this way. (If this is of course the specified behavior of what should happen, we need to document that this is currently not supported...)

@@ -243,6 +248,11 @@ DOMHandler.prototype = {
this.doc.doctype = dt;
}
},

internalEntityDecl:function(name, value) {
this.entityMap[name] = value;
Copy link
Member

@karfau karfau Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud without checking any spec:

  • Is it reasonable to allow overriding existing declarations?

The Name identifies the entity in an entity reference or, in the case of an unparsed entity, in the value of an ENTITY or ENTITIES attribute. If the same entity is declared more than once, the first declaration encountered is binding; at user option, an XML processor may issue a warning if entities are declared multiple times.

https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-entity-decl

  • What about an entity with the name prototype?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question and spec reference! Since the default entityMaps defined in lib/entities.js are frozen (using Object.freeze from lib/conventions.js), this kind of works right now; but only the predefined entities, not if repeating a declaration.

So I added an explicit check and redeclaration warning in 311ffc3.

The prototype is defined on the type, not in the prototype, so an undefined 'prototype' key returns undefined (and a &prototype; entity can be defined in an entity declaration).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the prototypical inheritance is an old manoeuvre which ought to be efficient. But at least moving it out to a utility could be wise. I did consider using Object.assign here, but AFAIK that is an ES6 feature. (If a future overhaul decides to use e.g. hasOwnProperty tests, or replace the entityMap with an ES6 Map here, this would of course have to be reworked.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default entityMaps defined in lib/entities.js are frozen (using Object.freeze from lib/conventions.js)

Please be aware it will not be frozen in a runtime where the Object.freeze method is not available. So thank you for adding that check.

I think it would be good to add a test for that, and also the prototype entity, just to be sure it doesn't break in the future.

Copy link
Member

@karfau karfau Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the prototypical inheritance is an old manoeuvre which ought to be efficient. But at least moving it out to a utility could be wise. I did consider using Object.assign here, but AFAIK that is an ES6 feature. (If a future overhaul decides to use e.g. hasOwnProperty tests, or replace the entityMap with an ES6 Map here, this would of course have to be reworked.)

I just played with this approach a bit and after checking how the entityMap is currently used to replace entities filed the bug #370.
So once that bug is fixed your current solution no longer works.

I assume we will not get around copying the entities. (I wouldn't want to touch the sax parser more then absolutely required right now, since we have plans to get rid of it: #55.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mentioned bug was now fixed and the fix was released as 0.8.1.

Please update your branch (I can also do that if you prefer), after which I assume some tests would fail. The solution should be to copy the entitymap.
Since we can not just rely on Object.assign I will provide something similar as part of lib/conventions.js in the next days (in a separate PR).

Copy link
Member

@karfau karfau Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has been done as part of #379

Comment on lines +572 to +581
var value = source.substring(declStart, delimEnd);
// NOTE: This value is not further processed, but treated as PCDATA.
// If recursive processing is implemented, ENSURE to avoid an XML Bomb
// attack vector!
domBuilder.internalEntityDecl(name, value);
var nextTagStart = source.indexOf('<', end);
var dtdEnd = source.indexOf(']>', end);
if (dtdEnd < nextTagStart) {
domBuilder.endDTD();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found the following section of the spec:
https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-predefined-ent

All XML processors must recognize these entities whether they are declared or not. For interoperability, valid XML documents should declare these entities, like any others, before using them. If the entities lt or amp are declared, they must be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped; the double escaping is required for these entities so that references to them produce a well-formed result. If the entities gt, apos, or quot are declared, they must be declared as internal entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is optional but harmless). For example:

<!ENTITY lt     "&#38;#60;">
<!ENTITY gt     "&#62;">
<!ENTITY amp    "&#38;#38;">
<!ENTITY apos   "&#39;">
<!ENTITY quot   "&#34;">

Does the current implementation support this? (We should add a test if it's not already there.)
I didn't look into the tests yet, but I wonder if it could make sense to at least do a single replacement run as part of domBuilder.internalEntityDecl since it has access to the existing entityMap. Or would that already open up the "can of worms"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, this specific recommended re-declaration isn't technically supported (and there is apparently no test for it). Of course these are predefined so they are already available (and I cannot recall seeing this re-declaration in practise, at least not internally in documents).

As I read this, it seems that while re-declarations are allowed to issue warnings, these specific characters (being the XML_ENTITIES) are allowed to be re-declared by these exact values. That could be hard-coded, I guess, now or later on, to avoid the then improper warning if this is done.

And yes, I believe doing a replacement run on the entity values could open up that can. The Billion laughs attack doesn't rely upon recursion, but upon exponential growth, which this would enable. It should then be combined with e.g. monitoring of value growth, such as by carefully counting expansions, as libxml2 does with its entity reference loop detection (and corresponding --huge option to xmllint).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for elaborating on this topic, very much appreciated.
After reading the Wikipedia article it became obvious.

I fully agree that there is no need for implementing this "exception" of the "first definition is binding" rule for XML entities, since the only allowed option is to redefine them to the same value, so in the worst case there is a warning about this not being supported, but the values still being correct.

I just started to think about (X)HTML and entities.

  • XHTML both get's the html entities and is otherwise working like XML, so that shouldn't be a problem.
  • Since the living HTML spec clearly defines what the doctype can contain:
    https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
    we should check what the browser does if it is provided a doctype with entities and the html mime type, and act accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my confusion, currently xmldom basically only knows about XHTML and treats HTML in the "same way".
So this will be part of #338

lib/sax.js Outdated
var chunk = source.substring(start+8, end);
var match = chunk.match(/\s+(\w+)\s+(["'])/);
if (!match) {
errorHandler.error("Malformed internal entity declaration");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Would that trigger for all the "unsupported" DTD cases?
Currently we are just ignoring those elements, if we now report an error, this might lead to "breaking behavior".

Which is a decision we can make, but we need to consider our options.

Or asked differently: Is there some other available entity syntax that we would consider "legit" but is actually not part of the supported cases and could cause any weird issues?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are absolutely right; there are other forms and only due to a sloppy regexp did this not trip up on those. I've corrected that in 0d665a5 and removed the comment in favour of just ignoring them for the time being.

Comment on lines +306 to +307
"actual": "<!DOCTYPE doc><doc>A</doc>",
"expected": "<doc>A</doc>",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This is something I might consider a "failing test", since previously the lib was yielding the expected result (not added to the snapshot twice when actual === expected), and now there is additional stuff.
Of course we completely ignored/removed the doctype from the input, and now it's considered, so we might decie to live with it.

But I agree that this needs some checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From all the information I could find from the perspective of specs, I couldn't find anything related to dropping the DOCTYPE under certain conditions:
https://w3c.github.io/DOM-Parsing/#dfn-xml-serializing-a-documenttype-node

I'm assuming the the "expected" value for some reason never contains the doctype, even if it' present in the input.
So from that perspective I'm assuming this isn't an issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that too; i.e. why the tests are so defined. It seems the behaviour now (to output the doctype declaration even if empty) is in line with what both libxml2 and browsers do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only clue I have is that the author of those test cases is also talking about "Canonical XML" and that this might be the reason why the doctype is not part of the expected output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following is part of the xmltest.zip (file canonxml.html) bundled as part of the xmltest lib we use:

Canonical XML

This document defines a subset of XML called canonical XML. The intended use of canonical XML is in testing XML processors, as a representation of the result of parsing an XML document.

Every well-formed XML document has a unique structurally equivalent canonical XML document. Two structurally equivalent XML documents have a byte-for-byte identical canonical XML document. Canonicalizing an XML document requires only information that an XML processor is required to make available to an application.

A canonical XML document conforms to the following grammar:

CanonXML    ::= Pi* element Pi*
element     ::= Stag (Datachar | Pi | element)* Etag
Stag        ::= '<'  Name Atts '>'
Etag        ::= '</' Name '>'
Pi          ::= '<?' Name ' ' (((Char - S) Char*)? - (Char* '?>' Char*)) '?>'
Atts        ::= (' ' Name '=' '"' Datachar* '"')*
Datachar    ::= '&amp;' | '&lt;' | '&gt;' | '&quot;'
                 | '&#9;'| '&#10;'| '&#13;'
                 | (Char - ('&' | '<' | '>' | '"' | #x9 | #xA | #xD))
Name        ::= (see XML spec)
Char        ::= (see XML spec)
S           ::= (see XML spec)

Attributes are in lexicographical order (in Unicode bit order).

A canonical XML document is encoded in UTF-8.

Ignorable white space is considered significant and is treated equivalently to data.

No doctype present, so we should drop it from the test output after serialization (instead of the test input), to get closer to the expected values.

Make a more careful check for Parsed Entity (PEDef) forms to skip
embedded markup.
Comment on lines +560 to +576
var match = chunk.match(/^\s+(?:(%)\s+)?(\S+)\s+(["'])/);
if (!match) {
// NOTE: Ignoring unhandled forms of entity declarations
return -1;
}
var declStart = start+8+match[0].length;
var name = match[2];
var delim = match[3];
var delimEnd = source.indexOf(delim, declStart);
if (delimEnd > end) {
end = source.indexOf('>', delimEnd);
}
if (match[1] === '%') {
// NOTE: Ignoring the PEDef form of entity declaration after forwarding
// to the declaratino end.
return end;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For landing this PR there are two things I would like to have added regarding this:

  • add errorHandler.warn for the unsupported cases
  • unit tests for "all" the supported and unsupported "type of cases" we can think of.
    I assume we should at least cover all the different cases that are present in node_modules/xmltest/xmltest.zip, I can support you in collecting/unifying them if required.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm short on time at the moment alas, so I'd appreciate some pointers for collecting those cases, and/or suggestions for where to add new tests (even if that is just a link to some contribution guideline I may have missed).

I also agree that warnings are prudent for the range of unsupported entity forms. I just want to know how to balance that with the current "lenient/silent on unsupported" approach in other parts of this code. "Leave things in a better state than when entered" is perfectly viable here, I mostly want to know where to set the bar. (There is the risk of needing to handle the nested DOCTYPE state just to emit warnings. I think we can dodge that, to avoid increasing the SAX parser complexity if it is slated for a bigger overhaul or replacement.)

Copy link
Member

@karfau karfau Feb 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started looking into the (number of) cases we need to test vs support.
Even though I'm not done yet, I start to have a feeling of what we are talking about.
I will dig more into "expanding" the top level of intSubset, it's so crazy what is allowed in there...

Note:

Regarding the !DOCTYPE tag:

/// already supported
<!DOCTYPE `Name``S?`>
<!DOCTYPE `Name` SYSTEM "`[^"]*`"`S?`>
<!DOCTYPE `Name` SYSTEM '`[^']*`'`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" "`[^"]*`"`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' "`[^"]*`"`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" '`[^']*`'`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' '`[^']*`'`S?`>
/// newly supported
<!DOCTYPE `Name``S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` SYSTEM "`[^"]*`"`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` SYSTEM '`[^']*`'`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" "`[^"]*`"`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' "`[^"]*`"`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC "`PubidChar*`" '`[^']*`'`S?`[`intSubset`]`S?`>
<!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' '`[^']*`'`S?`[`intSubset`]`S?`>

regarding the !ENTITY declarations:

/// supported 
<!ENTITY `Name` "`([^%&"] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`>
<!ENTITY `Name` '`([^%&'] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`>
  // with every option just once:
  <!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`>
  <!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`>
  <!ENTITY `Name` "%`Name`;"`S?`> || <!ENTITY `Name` '%`Name`;'`S?`>
  <!ENTITY `Name` "&`Name`;"`S?`> || <!ENTITY `Name` '&`Name`;'`S?`>
  <!ENTITY `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY `Name` '&#`[0-9]+`;'`S?`>
  <!ENTITY `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY `Name` '&#x`[0-9a-fA-F]+`;'`S?`>

/// unsupported
<!ENTITY `Name` SYSTEM "`[^"]*`"`S?`>
<!ENTITY `Name` SYSTEM ''`[^']*`'`S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`"`S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`"`S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`'`S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`'`S?`>
<!ENTITY `Name` SYSTEM "`[^"]*`" NDATA `Name``S?`>
<!ENTITY `Name` SYSTEM "`[^']*`' NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`" NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`" NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`' NDATA `Name``S?`>
<!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`' NDATA `Name``S?`>
/// aka Parameter entities
<!ENTITY % `Name` "`([^%&"] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`>
<!ENTITY % `Name` '`([^%&'] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`>
  // with every option just once:
  <!ENTITY % `Name` ""`S?`> || <!ENTITY % `Name` ''`S?`>
  <!ENTITY % `Name` "`[^%&"]*`"`S?`> || <!ENTITY % `Name` '`[^%&"]*`'`S?`>
  <!ENTITY % `Name` "%`Name`;"`S?`> || <!ENTITY % `Name` '%`Name`;'`S?`>
  <!ENTITY % `Name` "&`Name`;"`S?`> || <!ENTITY % `Name` '&`Name`;'`S?`>
  <!ENTITY % `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY % `Name` '&#`[0-9]+`;'`S?`>
  <!ENTITY % `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY % `Name` '&#x`[0-9a-fA-F]+`;'`S?`>
<!ENTITY % `Name` SYSTEM "`[^"]*`"`S?`>
<!ENTITY % `Name` SYSTEM '`[^']*`'`S?`>
<!ENTITY % `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`"`S?`>
<!ENTITY % `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`"`S?`>
<!ENTITY % `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`'`S?`>
<!ENTITY % `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`'`S?`>

Copy link
Member

@karfau karfau Feb 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think that https://github.com/xmldom/xmldom/blob/master/test/parse/doctype.test.js is a perfect place for those additional tests.

I assume that I will soon be able to provide some samples of what I have in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just invested at least another hour into doing the same for intSubset, this is sooo crazy I don't want to go any deeper. I'm now quite confident that we have to pick the following strategy for anything unsupported (like the whole [intSubset] was before: xmldom should continue to fail hard when anything unsupported / unexpected appears.

I also think that the current implementation doesn't really support everything I marked as supported in the ENTITY part above, but rather the following is supported:

<!ENTITY `Name` "`([^%&"] | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`>
<!ENTITY `Name` '`([^%&'] | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`>
  // with every option just once:
  <!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`>
  <!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`>
  <!ENTITY `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY `Name` '&#`[0-9]+`;'`S?`>
  <!ENTITY `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY `Name` '&#x`[0-9a-fA-F]+`;'`S?`>

or maybe even just

<!ENTITY `Name` "`([^%&"])*`"`S?`>
<!ENTITY `Name` '`([^%&'])*`'`S?`>
  // with every option just once:
  <!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`>
  <!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`>

and everything else fails as before.

And if we don't find a reliable way to know when the intSubset/ DOCTYPE ends when doing that, I would argue this could even be a fatalError / ParseError which throws by default.

What do you think?

@karfau
Copy link
Member

karfau commented Jan 27, 2022

And I think it's important to say that: Thank you again and again @niklasl for investing that time.

I'm still hesitant to actually contribute code/tests in this PR, since the topic is not a high prio for us.

If at some point you are not able/willing to continue on this PR, just make it explicit that further changes need to come from somebody else.

karfau added a commit to karfau/xmldom that referenced this pull request Feb 15, 2022
since we can not rely on it being present in all supported runtimes.
Even though the interface is the same as `Object.assign`,
it behaves slightly differently from the one provided by browsers.

This was extracted from xmldom#338 to support development in xmldom#367
karfau added a commit that referenced this pull request Feb 15, 2022
since we can not rely on it being present in all supported runtimes.
Even though the interface is the same as `Object.assign`,
it behaves slightly differently from the one provided by browsers (see tests).

This was extracted from #338 to support development in #367
@aj-stein-nist
Copy link

👋 I have been evaluating this wonderful library, and this PR in particular. Thanks so much to all of you for wonderful, reliable open-source library. I am interested in using this for a project to parse, serialize, and de-serialize data models in XML, but the production copies make use of external entities. Here is one such example.

So, a few questions about this PR:

  1. Is it on track to be merged pending resolving current feedback be addressed, or is significantly more work required?
  2. Re 1, I know this PR is about internal DTD entity resolution, but the core plumbing of the PR would be extremely helpful (or I feel starting from scratch means a lot of serious merge conflicts) for a separate issue and PR leveraging this PR's code. a) Would you want such an issue be open? b) Will you support the implementation of such features as requested in a GH issue from a if I were to request it or you explicitly do not want external entity resolution in this library? c) Would I work on a fork of this branch or wait until this PR is merged? I stepped through my WIP code using this library a few dozen times and found this

(I totally understand why if you don't!)

@karfau
Copy link
Member

karfau commented Mar 30, 2022

Thank you for asking @aj-stein-nist.

  • Is it on track to be merged pending resolving current feedback be addressed, or is significantly more work required?

I'm really torn here. I was hoping that we would be able to finish and merge this PR before I'm getting that close to land a different huge PR that changes a lot of things, which will make it difficult to resolve the upcoming conflicts. But it doesn't seem to be the case, as there has not been a reply from @niklasl for two month.

And it's sad, but I don't even remember all the feedback that would need to be resolved, but I think there is quite a bunch.

I'm not able to work on this topic, but if somebody wants to pick this up we might find a way to push the changes into this PR/branch. (If it helps I could also recreate this branch in the main repo.)

The main "concern" with all this entity support is that we are planning to drop the current sax implementation and replace it a with a dependency that is maintained by somebody else. And as far as I remember, even the most promising candidates for that do not have any entity support.
So this might be something we have for some time but might need to drop again or reimplement/contribute into the sax parser we pick.

a) Would you want such an issue be open?

I think #117 is the one to use for further discussion.

b) Will you support the implementation of such features as requested in a GH issue from a if I were to request it or you explicitly do not want external entity resolution in this library?

The second main concern with external entities is that we need to handle "How to resolve external references to files or definition?" in a way that it can be plugged in as a handler if needed, and we would still need to ignore those things when no such handler is provided.
The main point bein that this library is used in many different environments and we will not introduce the complexity of "loading resources in various runtimes" into this library.

I hope this answers enough of your questions so you can decide what to do about it.

PS: I think I will close this PR if there is no further information about possible contributors working on this topic within a month.

@karfau karfau added the awaiting response Maintainers are waiting for information label Mar 30, 2022
@aj-stein-nist
Copy link

Thank you for asking @aj-stein-nist.

You're welcome!

I'm really torn here. I was hoping that we would be able to finish and merge this PR before I'm getting that close to land a different huge PR that changes a lot of things, which will make it difficult to resolve the upcoming conflicts. But it doesn't seem to be the case, as there has not been a reply from @niklasl for two month.

Oh ok. 😢

So this might be something we have for some time but might need to drop again or reimplement/contribute into the sax parser we pick.

Ironically, I am experimenting with sax-js right now to see if that is an easier lift (kinda, but maybe not after the last few hours today, haha). I think that is what you are suggesting with #55, right? (It got linked to a post in the discussion thread of this PR, not the issue; so I am just double-checking.)

The second main concern with external entities is that we need to handle "How to resolve external references to files or definition?" in a way that it can be plugged in as a handler if needed, and we would still need to ignore those things when no such handler is provided. The main point bein that this library is used in many different environments and we will not introduce the complexity of "loading resources in various runtimes" into this library.

This is the juicy bit, be it this library or another, that I am trying to work around in my mind to understand how to make it pluggable not completely gut an existing library or "just start over." That's why I came here to consult first.

PS: I think I will close this PR if there is no further information about possible contributors working on this topic within a month.

Understandable. I will continue discussion in #117 as needed about future work, irrespective of what happens in this PR, but continue to follow progress here of course. 👋

@karfau
Copy link
Member

karfau commented Mar 31, 2022

I think that is what you are suggesting with #55, right? (It got linked to a post in the discussion thread of this PR, not the issue; so I am just double-checking.)

Yes, that is what I was trying to link to

@niklasl
Copy link
Author

niklasl commented Apr 23, 2022

I'm sorry about not being able to finish this. As I do not actively use this library and am quite swamped at work (as usual), I haven't had any dedicated time for further work here.

I'm also torn since I was hoping the changes as they are were mostly better than not being able to parse internal entities at all. The added requirements to land this are solid in themselves, but this library doesn't seem to have the plumbing in place for those requirements, and as it stands today silently ignores internal DTD parts, so it is much more of an undertaking I suspect. That seems to require a level of dedication (in my spare time) that I've found hard to motivate given that it seems this might be scrapped when the rewrite takes place?

@karfau
Copy link
Member

karfau commented Apr 23, 2022

@niklasl
Thank you for the open and honest remarks.
I agree that it's quite hard to pull this off in the current state. And since we will most likely need to either drop the feature again or contribute this back to another sax parser once that happens, I totally understand that the time investment is not really a sustainable one.

Thank you for taking all the time you did and also for taking the time to respond here.

Just in case anybody want's to pick this up anyways, I decided to push your branch into the main repo as is: https://github.com/xmldom/xmldom/tree/internal-entity-declarations

So you don't have to keep this branch/repo of yours, but the current state of the changes is preserved and can be "forked".

@karfau karfau closed this Apr 23, 2022
@niklasl
Copy link
Author

niklasl commented Apr 24, 2022

Thank you @karfau I think this is reasonable. I may have another go at this, targeting the new sax parser once it is in place (either by contributing to that, or, depending on its level of extensibility, by configuring it from within xmldom). Is it possible to note in #117 when the progress of integrating the new parser is stable enough for such an endeavour to begin? (I suppose it might take time if #55 is still on the drawing board.)

@karfau
Copy link
Member

karfau commented Jul 3, 2023

@niklasl @aj-stein-nist FYI we are currently working on a related PR: #498

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response Maintainers are waiting for information spec:XML https://www.w3.org/TR/xml11/
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants