Parse internal entity declarations in internal DTD #367

niklasl · 2022-01-22T16:21:29Z

This handles the basic (and fairly common) use of internal DTD entity declarations in RDF/XML. Example:

<?xml version="1.0"?>
<!DOCTYPE rdf:RDF [
    <!ENTITY rdf 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
    <!ENTITY sdo 'http://schema.org/'>
]>
<rdf:RDF xmlns:rdf="&rdf;"
         xmlns="&sdo;"
         xml:base="http://example.org/">
    <CreativeWork rdf:about="/doc">
        <name xml:lang="en">A Document &amp; a &quot;name&quot;</name>
        <dateCreated rdf:datatype="&sdo;Date">2016-07-10T14:34:50+0200</dateCreated>
    </CreativeWork>
</rdf:RDF>

Note that this needs:

Some unit testing. (I just threw the case above at my local instance of LDTR, using this modified xmldom code.)
Overseeing the way I handled the entityMap.
Actually matching the root element (QName) to be conformant.
Ensurance that this doesn't lead to an XML Bomb attack vector. Tthis specific handling is non-recursive and safe as written. But if logic for external references or recursive entities would be added this quickly gets complex and dangerous.

When ready, this should solve the basic case of #117.

karfau · 2022-01-22T16:58:24Z

@niklasl thank you for the initiative, since this is still a draft PR I will not look into the details yet, but will approve the workflow runs.

I agree that we need some tests for this stuff, if you need help to decide what to test or where to put them, just let me know.

Are you aware of the issue regarding multiline doctype not being supported?
I'm currently assuming that your PR might also need to solve that (if it doesn't already).

If it does we might also be able to drop this workaround from our tests:

xmldom/test/xmltest/valid.test.js

Lines 17 to 19 in c3600ca

    
           // TODO: The DOCTYPE totally confuses xmldom :sic: 
        
           // for now we remove it and any newlines after it so we have reasonable tests 
        
           .replace(/^<!DOCTYPE doc \[[^\]]+]>[\r\n]*/m, '')

Don't be afraid if there are failing tests, this can also happen when we are improving our support of the XML spec.

For a "final"/"reviewable" part of the PR I would prefer if there are no new TODO comments in the code and any new methods are properly documented (e.g. by copying method descriptions and links from the related specs).

Please be also aware of #338 which separates XML from HTML parsing and might already have conflicting changes. But if your's is ready faster, I will be able to integrate your changes into that branch. And I'm not really actively working on that branch right now. So you chances are good.

lib/sax.js

karfau · 2022-01-22T19:59:58Z

Another aspect that just occured to me: How relevant is what this produces when it is serialized?

Are we fine withe the entities no longer being used?
Is the doctype serialized as is?

Not sure how relevant those things are and I think we should again check how brosers are dealing with it.

Co-authored-by: Christian Bewernitz <coder@karfau.de>

niklasl · 2022-01-22T23:00:25Z

Thank you @karfau! I'll take a look at finalizing this. It's great to have your eyes on it as it stands though; so feel free to add more suggestions even in draft state. I'd like to make a small enough change for just the internal entity subset, but of course a consistent and thorough one at that.

I'm not fully aware of the multiline doctype issue, but I suspect it's related to the TODO about adding a parseInternalDTD? I agree that this part ought to be done before merging.

From my perspective, expanding the entities upon parsing and thus not keeping and serializing the entity declarations is fine. Good question about the doctype though, need to check how these changes affect that. (And yes, compare the behaviour with browser implementations.)

It may have been altered by XML entity expansion.

niklasl · 2022-01-23T14:33:16Z

So, throwing some XML at xmllint --noent indicates that we might get away with adding the entities to the simple entityMap; thus not having to map the entities to a named doctype and picking that with the root element name. I think this simplifies the minimum fix here.

The serializer writes out the empty DTD though, which is valid but pointless I guess. We could add a check in that portion for empty publicID and systemID and just omit it, FWIW, either at parse or write time. Any opinions here?

I'd suggest doing a more comprehensive DTD handler as a more major task. That might entail rewriting the parser a bit, as I believe e.g. extracting a parseInternalDTD from parseDCC needs nested state handling. I'll probably remove the TODO for that instead to avoid making this PR more involved than called for. (The conditional (state-delegated) endDTD call is the one thing I would have liked to improve on, but it calls for this additional level of parser state to handle properly, I think.)

(This fix only deals with simple string values for entities anyway, mostly because subparsing the string values as XML fragments is complex, not the least due to the XML Bomb risk if done carelessly. Writing a warning about that in the code makes sense too.)

That'd make editing test/xmltest/valid.test.js and test/xmltest/__snapshots__/valid.test.js.snap to pass the one thing left before this is ready for review, from my point of view.

niklasl · 2022-01-23T16:59:41Z

Expanding all entities and writing out an empty DTD is precisely what both Firefox and Chromium do too. Thus I'll keep this behaviour and adapt the tests to reflect this.

This reflects that the internal XML entities are now parsed and expanded, and that an empty DOCTYPE is output. Although the xmltest specs (from James Clark) do not contain these empty DOCTYPE declarations, apparently Firefox and Chromium-based web browsers do.

These replace the TODO:s about adding a parseInternalDTD method (as that would require a more complex change).

This reflects the now added somewhat lax internal DOCTYPE processing, which now yields documents even with certain non-well-formed DOCTYPE:s.

karfau · 2022-01-24T18:26:21Z

(This fix only deals with simple string values for entities anyway, mostly because subparsing the string values as XML fragments is complex, not the least due to the XML Bomb risk if done carelessly. Writing a warning about that in the code makes sense too.)

Well this definitely sounds like an "attack vector"(?), there is nothing preventing somebody from putting an XML string into an entity (and I would guess that there is even an example for it in the xmltest folder/snapshots, so we should also have a test for that even if it just shows that everything is just properly escaped to not introduce new xml nodes this way. (If this is of course the specified behavior of what should happen, we need to document that this is currently not supported...)

karfau · 2022-01-24T18:28:47Z

lib/dom-parser.js

@@ -243,6 +248,11 @@ DOMHandler.prototype = {
 					this.doc.doctype = dt;
 	    }
 	},
+
+	internalEntityDecl:function(name, value) {
+		this.entityMap[name] = value;


Just thinking out loud ~~without checking any spec~~:

Is it reasonable to allow overriding existing declarations?

The Name identifies the entity in an entity reference or, in the case of an unparsed entity, in the value of an ENTITY or ENTITIES attribute. If the same entity is declared more than once, the first declaration encountered is binding; at user option, an XML processor may issue a warning if entities are declared multiple times.

https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-entity-decl

What about an entity with the name prototype?

Good question and spec reference! Since the default entityMaps defined in lib/entities.js are frozen (using Object.freeze from lib/conventions.js), this kind of works right now; but only the predefined entities, not if repeating a declaration.

So I added an explicit check and redeclaration warning in 311ffc3.

The prototype is defined on the type, not in the prototype, so an undefined 'prototype' key returns undefined (and a &prototype; entity can be defined in an entity declaration).

Also, the prototypical inheritance is an old manoeuvre which ought to be efficient. But at least moving it out to a utility could be wise. I did consider using Object.assign here, but AFAIK that is an ES6 feature. (If a future overhaul decides to use e.g. hasOwnProperty tests, or replace the entityMap with an ES6 Map here, this would of course have to be reworked.)

Since the default entityMaps defined in lib/entities.js are frozen (using Object.freeze from lib/conventions.js)

Please be aware it will not be frozen in a runtime where the Object.freeze method is not available. So thank you for adding that check.

I think it would be good to add a test for that, and also the prototype entity, just to be sure it doesn't break in the future.

Also, the prototypical inheritance is an old manoeuvre which ought to be efficient. But at least moving it out to a utility could be wise. I did consider using Object.assign here, but AFAIK that is an ES6 feature. (If a future overhaul decides to use e.g. hasOwnProperty tests, or replace the entityMap with an ES6 Map here, this would of course have to be reworked.)

I just played with this approach a bit and after checking how the entityMap is currently used to replace entities filed the bug #370.
So once that bug is fixed your current solution no longer works.

I assume we will not get around copying the entities. (I wouldn't want to touch the sax parser more then absolutely required right now, since we have plans to get rid of it: #55.)

The mentioned bug was now fixed and the fix was released as 0.8.1.

Please update your branch (I can also do that if you prefer), after which I assume some tests would fail. The solution should be to copy the entitymap.
Since we can not just rely on Object.assign I will provide something similar as part of lib/conventions.js in the next days (in a separate PR).

Has been done as part of #379

karfau · 2022-01-24T18:59:16Z

lib/sax.js

+			var value = source.substring(declStart, delimEnd);
+			// NOTE: This value is not further processed, but treated as PCDATA.
+			// If recursive processing is implemented, ENSURE to avoid an XML Bomb
+			// attack vector!
+			domBuilder.internalEntityDecl(name, value);
+			var nextTagStart = source.indexOf('<', end);
+			var dtdEnd = source.indexOf(']>', end);
+			if (dtdEnd < nextTagStart) {
+				domBuilder.endDTD();
+			}


I just found the following section of the spec:
https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-predefined-ent

All XML processors must recognize these entities whether they are declared or not. For interoperability, valid XML documents should declare these entities, like any others, before using them. If the entities lt or amp are declared, they must be declared as internal entities whose replacement text is a character reference to the respective character (less-than sign or ampersand) being escaped; the double escaping is required for these entities so that references to them produce a well-formed result. If the entities gt, apos, or quot are declared, they must be declared as internal entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is optional but harmless). For example:

<!ENTITY lt "&#60;"> <!ENTITY gt ">"> <!ENTITY amp "&#38;"> <!ENTITY apos "'"> <!ENTITY quot """>

Does the current implementation support this? (We should add a test if it's not already there.)
I didn't look into the tests yet, but I wonder if it could make sense to at least do a single replacement run as part of domBuilder.internalEntityDecl since it has access to the existing entityMap. Or would that already open up the "can of worms"?

Ah, no, this specific recommended re-declaration isn't technically supported (and there is apparently no test for it). Of course these are predefined so they are already available (and I cannot recall seeing this re-declaration in practise, at least not internally in documents).

As I read this, it seems that while re-declarations are allowed to issue warnings, these specific characters (being the XML_ENTITIES) are allowed to be re-declared by these exact values. That could be hard-coded, I guess, now or later on, to avoid the then improper warning if this is done.

And yes, I believe doing a replacement run on the entity values could open up that can. The Billion laughs attack doesn't rely upon recursion, but upon exponential growth, which this would enable. It should then be combined with e.g. monitoring of value growth, such as by carefully counting expansions, as libxml2 does with its entity reference loop detection (and corresponding --huge option to xmllint).

Thank you for elaborating on this topic, very much appreciated.
After reading the Wikipedia article it became obvious.

I fully agree that there is no need for implementing this "exception" of the "first definition is binding" rule for XML entities, since the only allowed option is to redefine them to the same value, so in the worst case there is a warning about this not being supported, but the values still being correct.

I just started to think about (X)HTML and entities.

XHTML both get's the html entities and is otherwise working like XML, so that shouldn't be a problem.

Since the living HTML spec clearly defines what the doctype can contain:
https://html.spec.whatwg.org/multipage/syntax.html#the-doctype
we should check what the browser does if it is provided a doctype with entities and the html mime type, and act accordingly.

Sorry, my confusion, currently xmldom basically only knows about XHTML and treats HTML in the "same way".
So this will be part of #338

karfau · 2022-01-24T19:02:47Z

lib/sax.js

+			var chunk = source.substring(start+8, end);
+			var match = chunk.match(/\s+(\w+)\s+(["'])/);
+			if (!match) {
+				errorHandler.error("Malformed internal entity declaration");


Question: Would that trigger for all the "unsupported" DTD cases?
Currently we are just ignoring those elements, if we now report an error, this might lead to "breaking behavior".

Which is a decision we can make, but we need to consider our options.

Or asked differently: Is there some other available entity syntax that we would consider "legit" but is actually not part of the supported cases and could cause any weird issues?

Yes, you are absolutely right; there are other forms and only due to a sloppy regexp did this not trip up on those. I've corrected that in 0d665a5 and removed the comment in favour of just ignoring them for the time being.

karfau · 2022-01-24T19:12:29Z

test/xmltest/__snapshots__/valid.test.js.snap

+  "actual": "<!DOCTYPE doc><doc>A</doc>",
+  "expected": "<doc>A</doc>",


FYI: This is something I might consider a "failing test", since previously the lib was yielding the expected result (not added to the snapshot twice when actual === expected), and now there is additional stuff.
Of course we completely ignored/removed the doctype from the input, and now it's considered, so we might decie to live with it.

But I agree that this needs some checking.

From all the information I could find from the perspective of specs, I couldn't find anything related to dropping the DOCTYPE under certain conditions:
https://w3c.github.io/DOM-Parsing/#dfn-xml-serializing-a-documenttype-node

I'm assuming the the "expected" value for some reason never contains the doctype, even if it' present in the input.
So from that perspective I'm assuming this isn't an issue.

I was wondering about that too; i.e. why the tests are so defined. It seems the behaviour now (to output the doctype declaration even if empty) is in line with what both libxml2 and browsers do.

The only clue I have is that the author of those test cases is also talking about "Canonical XML" and that this might be the reason why the doctype is not part of the expected output.

The following is part of the xmltest.zip (file canonxml.html) bundled as part of the xmltest lib we use:

Canonical XML

This document defines a subset of XML called canonical XML. The intended use of canonical XML is in testing XML processors, as a representation of the result of parsing an XML document.

Every well-formed XML document has a unique structurally equivalent canonical XML document. Two structurally equivalent XML documents have a byte-for-byte identical canonical XML document. Canonicalizing an XML document requires only information that an XML processor is required to make available to an application.

A canonical XML document conforms to the following grammar:

CanonXML ::= Pi* element Pi* element ::= Stag (Datachar | Pi | element)* Etag Stag ::= '<' Name Atts '>' Etag ::= '</' Name '>' Pi ::= '<?' Name ' ' (((Char - S) Char*)? - (Char* '?>' Char*)) '?>' Atts ::= (' ' Name '=' '"' Datachar* '"')* Datachar ::= '&' | '<' | '>' | '"' | '	'| '
'| '' | (Char - ('&' | '<' | '>' | '"' | #x9 | #xA | #xD)) Name ::= (see XML spec) Char ::= (see XML spec) S ::= (see XML spec)

Attributes are in lexicographical order (in Unicode bit order).

A canonical XML document is encoded in UTF-8.

Ignorable white space is considered significant and is treated equivalently to data.

No doctype present, so we should drop it from the test output after serialization (instead of the test input), to get closer to the expected values.

Make a more careful check for Parsed Entity (PEDef) forms to skip embedded markup.

karfau · 2022-01-27T07:12:16Z

lib/sax.js

+			var match = chunk.match(/^\s+(?:(%)\s+)?(\S+)\s+(["'])/);
+			if (!match) {
+				// NOTE: Ignoring unhandled forms of entity declarations
+				return -1;
+			}
+			var declStart = start+8+match[0].length;
+			var name = match[2];
+			var delim = match[3];
+			var delimEnd = source.indexOf(delim, declStart);
+			if (delimEnd > end) {
+				end = source.indexOf('>', delimEnd);
+			}
+			if (match[1] === '%') {
+				// NOTE: Ignoring the PEDef form of entity declaration after forwarding
+				// to the declaratino end.
+				return end;
+			}


For landing this PR there are two things I would like to have added regarding this:

add errorHandler.warn for the unsupported cases

unit tests for "all" the supported and unsupported "type of cases" we can think of.
I assume we should at least cover all the different cases that are present in node_modules/xmltest/xmltest.zip, I can support you in collecting/unifying them if required.

I'm short on time at the moment alas, so I'd appreciate some pointers for collecting those cases, and/or suggestions for where to add new tests (even if that is just a link to some contribution guideline I may have missed).

I also agree that warnings are prudent for the range of unsupported entity forms. I just want to know how to balance that with the current "lenient/silent on unsupported" approach in other parts of this code. "Leave things in a better state than when entered" is perfectly viable here, I mostly want to know where to set the bar. (There is the risk of needing to handle the nested DOCTYPE state just to emit warnings. I think we can dodge that, to avoid increasing the SAX parser complexity if it is slated for a bigger overhaul or replacement.)

I started looking into the (number of) cases we need to test vs support.
Even though I'm not done yet, I start to have a feeling of what we are talking about.
I will dig more into "expanding" the top level of intSubset, it's so crazy what is allowed in there...

Note:

Anything in backticks uses the notation use din the xml1.1 spec as linked in the gist

space literal not wrapped with backticks is equivalent to mandatory S, which includes line breaks)

The source of how I derived those values is at https://gist.github.com/karfau/e224d99441b556007c4b01b4f5f2f981

Regarding the !DOCTYPE tag:

/// already supported <!DOCTYPE `Name``S?`> <!DOCTYPE `Name` SYSTEM "`[^"]*`"`S?`> <!DOCTYPE `Name` SYSTEM '`[^']*`'`S?`> <!DOCTYPE `Name` PUBLIC "`PubidChar*`" "`[^"]*`"`S?`> <!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' "`[^"]*`"`S?`> <!DOCTYPE `Name` PUBLIC "`PubidChar*`" '`[^']*`'`S?`> <!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' '`[^']*`'`S?`> /// newly supported <!DOCTYPE `Name``S?`[`intSubset`]`S?`> <!DOCTYPE `Name` SYSTEM "`[^"]*`"`S?`[`intSubset`]`S?`> <!DOCTYPE `Name` SYSTEM '`[^']*`'`S?`[`intSubset`]`S?`> <!DOCTYPE `Name` PUBLIC "`PubidChar*`" "`[^"]*`"`S?`[`intSubset`]`S?`> <!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' "`[^"]*`"`S?`[`intSubset`]`S?`> <!DOCTYPE `Name` PUBLIC "`PubidChar*`" '`[^']*`'`S?`[`intSubset`]`S?`> <!DOCTYPE `Name` PUBLIC '`(PubidChar - "'")*`' '`[^']*`'`S?`[`intSubset`]`S?`>

regarding the !ENTITY declarations:

/// supported <!ENTITY `Name` "`([^%&"] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`> <!ENTITY `Name` '`([^%&'] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`> // with every option just once: <!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`> <!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`> <!ENTITY `Name` "%`Name`;"`S?`> || <!ENTITY `Name` '%`Name`;'`S?`> <!ENTITY `Name` "&`Name`;"`S?`> || <!ENTITY `Name` '&`Name`;'`S?`> <!ENTITY `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY `Name` '&#`[0-9]+`;'`S?`> <!ENTITY `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY `Name` '&#x`[0-9a-fA-F]+`;'`S?`> /// unsupported <!ENTITY `Name` SYSTEM "`[^"]*`"`S?`> <!ENTITY `Name` SYSTEM ''`[^']*`'`S?`> <!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`"`S?`> <!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`"`S?`> <!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`'`S?`> <!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`'`S?`> <!ENTITY `Name` SYSTEM "`[^"]*`" NDATA `Name``S?`> <!ENTITY `Name` SYSTEM "`[^']*`' NDATA `Name``S?`> <!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`" NDATA `Name``S?`> <!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`" NDATA `Name``S?`> <!ENTITY `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`' NDATA `Name``S?`> <!ENTITY `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`' NDATA `Name``S?`> /// aka Parameter entities <!ENTITY % `Name` "`([^%&"] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`> <!ENTITY % `Name` '`([^%&'] | %`Name`; | &`Name`; | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`> // with every option just once: <!ENTITY % `Name` ""`S?`> || <!ENTITY % `Name` ''`S?`> <!ENTITY % `Name` "`[^%&"]*`"`S?`> || <!ENTITY % `Name` '`[^%&"]*`'`S?`> <!ENTITY % `Name` "%`Name`;"`S?`> || <!ENTITY % `Name` '%`Name`;'`S?`> <!ENTITY % `Name` "&`Name`;"`S?`> || <!ENTITY % `Name` '&`Name`;'`S?`> <!ENTITY % `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY % `Name` '&#`[0-9]+`;'`S?`> <!ENTITY % `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY % `Name` '&#x`[0-9a-fA-F]+`;'`S?`> <!ENTITY % `Name` SYSTEM "`[^"]*`"`S?`> <!ENTITY % `Name` SYSTEM '`[^']*`'`S?`> <!ENTITY % `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^"]*`"`S?`> <!ENTITY % `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^"]*`"`S?`> <!ENTITY % `Name` PUBLIC "`PubidChar*`" SYSTEM "`[^']*`'`S?`> <!ENTITY % `Name` PUBLIC '`(PubidChar - "'")*`' SYSTEM "`[^']*`'`S?`>

And I think that https://github.com/xmldom/xmldom/blob/master/test/parse/doctype.test.js is a perfect place for those additional tests.

I assume that I will soon be able to provide some samples of what I have in mind.

Just invested at least another hour into doing the same for intSubset, this is sooo crazy I don't want to go any deeper. I'm now quite confident that we have to pick the following strategy for anything unsupported (like the whole [intSubset] was before: xmldom should continue to fail hard when anything unsupported / unexpected appears.

I also think that the current implementation doesn't really support everything I marked as supported in the ENTITY part above, but rather the following is supported:

<!ENTITY `Name` "`([^%&"] | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`"`S?`> <!ENTITY `Name` '`([^%&'] | &#`[0-9]+`; | &#x`[0-9a-fA-F]+`;)*`'`S?`> // with every option just once: <!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`> <!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`> <!ENTITY `Name` "&#`[0-9]+`;"`S?`> || <!ENTITY `Name` '&#`[0-9]+`;'`S?`> <!ENTITY `Name` "&#x`[0-9a-fA-F]+`;"`S?`> || <!ENTITY `Name` '&#x`[0-9a-fA-F]+`;'`S?`>

or maybe even just

<!ENTITY `Name` "`([^%&"])*`"`S?`> <!ENTITY `Name` '`([^%&'])*`'`S?`> // with every option just once: <!ENTITY `Name` ""`S?`> || <!ENTITY `Name` ''`S?`> <!ENTITY `Name` "`[^%&"]*`"`S?`> || <!ENTITY `Name` '`[^%&"]*`'`S?`>

and everything else fails as before.

And if we don't find a reliable way to know when the intSubset/ DOCTYPE ends when doing that, I would argue this could even be a fatalError / ParseError which throws by default.

What do you think?

karfau · 2022-01-27T07:15:38Z

And I think it's important to say that: Thank you again and again @niklasl for investing that time.

I'm still hesitant to actually contribute code/tests in this PR, since the topic is not a high prio for us.

If at some point you are not able/willing to continue on this PR, just make it explicit that further changes need to come from somebody else.

since we can not rely on it being present in all supported runtimes. Even though the interface is the same as `Object.assign`, it behaves slightly differently from the one provided by browsers. This was extracted from xmldom#338 to support development in xmldom#367

since we can not rely on it being present in all supported runtimes. Even though the interface is the same as `Object.assign`, it behaves slightly differently from the one provided by browsers (see tests). This was extracted from #338 to support development in #367

aj-stein-nist · 2022-03-29T21:36:45Z

👋 I have been evaluating this wonderful library, and this PR in particular. Thanks so much to all of you for wonderful, reliable open-source library. I am interested in using this for a project to parse, serialize, and de-serialize data models in XML, but the production copies make use of external entities. Here is one such example.

So, a few questions about this PR:

Is it on track to be merged pending resolving current feedback be addressed, or is significantly more work required?
Re 1, I know this PR is about internal DTD entity resolution, but the core plumbing of the PR would be extremely helpful (or I feel starting from scratch means a lot of serious merge conflicts) for a separate issue and PR leveraging this PR's code. a) Would you want such an issue be open? b) Will you support the implementation of such features as requested in a GH issue from a if I were to request it or you explicitly do not want external entity resolution in this library? c) Would I work on a fork of this branch or wait until this PR is merged? I stepped through my WIP code using this library a few dozen times and found this

(I totally understand why if you don't!)

karfau · 2022-03-30T20:07:28Z

Thank you for asking @aj-stein-nist.

Is it on track to be merged pending resolving current feedback be addressed, or is significantly more work required?

I'm really torn here. I was hoping that we would be able to finish and merge this PR before I'm getting that close to land a different huge PR that changes a lot of things, which will make it difficult to resolve the upcoming conflicts. But it doesn't seem to be the case, as there has not been a reply from @niklasl for two month.

And it's sad, but I don't even remember all the feedback that would need to be resolved, but I think there is quite a bunch.

I'm not able to work on this topic, but if somebody wants to pick this up we might find a way to push the changes into this PR/branch. (If it helps I could also recreate this branch in the main repo.)

The main "concern" with all this entity support is that we are planning to drop the current sax implementation and replace it a with a dependency that is maintained by somebody else. And as far as I remember, even the most promising candidates for that do not have any entity support.
So this might be something we have for some time but might need to drop again or reimplement/contribute into the sax parser we pick.

a) Would you want such an issue be open?

I think #117 is the one to use for further discussion.

b) Will you support the implementation of such features as requested in a GH issue from a if I were to request it or you explicitly do not want external entity resolution in this library?

The second main concern with external entities is that we need to handle "How to resolve external references to files or definition?" in a way that it can be plugged in as a handler if needed, and we would still need to ignore those things when no such handler is provided.
The main point bein that this library is used in many different environments and we will not introduce the complexity of "loading resources in various runtimes" into this library.

I hope this answers enough of your questions so you can decide what to do about it.

PS: I think I will close this PR if there is no further information about possible contributors working on this topic within a month.

aj-stein-nist · 2022-03-30T21:54:58Z

Thank you for asking @aj-stein-nist.

You're welcome!

I'm really torn here. I was hoping that we would be able to finish and merge this PR before I'm getting that close to land a different huge PR that changes a lot of things, which will make it difficult to resolve the upcoming conflicts. But it doesn't seem to be the case, as there has not been a reply from @niklasl for two month.

Oh ok. 😢

So this might be something we have for some time but might need to drop again or reimplement/contribute into the sax parser we pick.

Ironically, I am experimenting with sax-js right now to see if that is an easier lift (kinda, but maybe not after the last few hours today, haha). I think that is what you are suggesting with #55, right? (It got linked to a post in the discussion thread of this PR, not the issue; so I am just double-checking.)

The second main concern with external entities is that we need to handle "How to resolve external references to files or definition?" in a way that it can be plugged in as a handler if needed, and we would still need to ignore those things when no such handler is provided. The main point bein that this library is used in many different environments and we will not introduce the complexity of "loading resources in various runtimes" into this library.

This is the juicy bit, be it this library or another, that I am trying to work around in my mind to understand how to make it pluggable not completely gut an existing library or "just start over." That's why I came here to consult first.

PS: I think I will close this PR if there is no further information about possible contributors working on this topic within a month.

Understandable. I will continue discussion in #117 as needed about future work, irrespective of what happens in this PR, but continue to follow progress here of course. 👋

karfau · 2022-03-31T03:47:22Z

I think that is what you are suggesting with #55, right? (It got linked to a post in the discussion thread of this PR, not the issue; so I am just double-checking.)

Yes, that is what I was trying to link to

niklasl · 2022-04-23T17:11:31Z

I'm sorry about not being able to finish this. As I do not actively use this library and am quite swamped at work (as usual), I haven't had any dedicated time for further work here.

I'm also torn since I was hoping the changes as they are were mostly better than not being able to parse internal entities at all. The added requirements to land this are solid in themselves, but this library doesn't seem to have the plumbing in place for those requirements, and as it stands today silently ignores internal DTD parts, so it is much more of an undertaking I suspect. That seems to require a level of dedication (in my spare time) that I've found hard to motivate given that it seems this might be scrapped when the rewrite takes place?

karfau · 2022-04-23T21:03:54Z

@niklasl
Thank you for the open and honest remarks.
I agree that it's quite hard to pull this off in the current state. And since we will most likely need to either drop the feature again or contribute this back to another sax parser once that happens, I totally understand that the time investment is not really a sustainable one.

Thank you for taking all the time you did and also for taking the time to respond here.

Just in case anybody want's to pick this up anyways, I decided to push your branch into the main repo as is: https://github.com/xmldom/xmldom/tree/internal-entity-declarations

So you don't have to keep this branch/repo of yours, but the current state of the changes is preserved and can be "forked".

niklasl · 2022-04-24T12:54:06Z

Thank you @karfau I think this is reasonable. I may have another go at this, targeting the new sax parser once it is in place (either by contributing to that, or, depending on its level of extensibility, by configuring it from within xmldom). Is it possible to note in #117 when the progress of integrating the new parser is stable enough for such an endeavour to begin? (I suppose it might take time if #55 is still on the drawing board.)

karfau · 2023-07-03T06:27:52Z

@niklasl @aj-stein-nist FYI we are currently working on a related PR: #498

Parse internal entity declarations in internal DTD

d949b51

niklasl mentioned this pull request Jan 22, 2022

To what degree should xmldom support DTD processing support for Entity/ EntityReference? #117

Closed

karfau added the spec:XML https://www.w3.org/TR/xml11/ label Jan 22, 2022

karfau reviewed Jan 22, 2022

View reviewed changes

lib/sax.js Outdated Show resolved Hide resolved

Stick to ES5 by consistently using var

57e7cf3

Co-authored-by: Christian Bewernitz <coder@karfau.de>

niklasl added 2 commits January 23, 2022 13:57

Use actual string length when sending characters

3ad7fba

It may have been altered by XML entity expansion.

Handle '>' in entity declaration values

d10f537

niklasl added 3 commits January 23, 2022 18:27

Add notes regarding processing the internal subset

1aa4055

These replace the TODO:s about adding a parseInternalDTD method (as that would require a more complex change).

Adjust not-wf tests to reflect DOCTYPE processing

49b358f

This reflects the now added somewhat lax internal DOCTYPE processing, which now yields documents even with certain non-well-formed DOCTYPE:s.

niklasl marked this pull request as ready for review January 23, 2022 18:09

karfau reviewed Jan 24, 2022

View reviewed changes

niklasl added 2 commits January 26, 2022 22:18

Keep existing internal entity and issue warning

311ffc3

Ignore unhandled forms of entity declarations

0d665a5

Make a more careful check for Parsed Entity (PEDef) forms to skip embedded markup.

karfau reviewed Jan 27, 2022

View reviewed changes

karfau mentioned this pull request Feb 15, 2022

feat: Add minimal Object.assign ponyfill #379

Merged

karfau added the awaiting response Maintainers are waiting for information label Mar 30, 2022

karfau closed this Apr 23, 2022

		"actual": "<!DOCTYPE doc><doc>A</doc>",
		"expected": "<doc>A</doc>",

Parse internal entity declarations in internal DTD #367

Parse internal entity declarations in internal DTD #367

Conversation

niklasl commented Jan 22, 2022

karfau commented Jan 22, 2022 • edited

karfau commented Jan 22, 2022

niklasl commented Jan 22, 2022

niklasl commented Jan 23, 2022

niklasl commented Jan 23, 2022

karfau commented Jan 24, 2022 • edited

karfau Jan 24, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karfau Jan 27, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karfau Feb 15, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Canonical XML

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karfau Feb 13, 2022 • edited

Choose a reason for hiding this comment

karfau Feb 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karfau commented Jan 27, 2022

aj-stein-nist commented Mar 29, 2022

karfau commented Mar 30, 2022 • edited

aj-stein-nist commented Mar 30, 2022

karfau commented Mar 31, 2022

niklasl commented Apr 23, 2022

karfau commented Apr 23, 2022

niklasl commented Apr 24, 2022

karfau commented Jul 3, 2023

karfau commented Jan 22, 2022 •

edited

karfau commented Jan 24, 2022 •

edited

karfau Jan 24, 2022 •

edited

karfau Jan 27, 2022 •

edited

karfau Feb 15, 2022 •

edited

karfau Feb 13, 2022 •

edited

karfau Feb 13, 2022 •

edited

karfau commented Mar 30, 2022 •

edited