fix: Only use HTML rules if mimeType matches #338

karfau · 2021-10-24T13:21:37Z

In the living specs for parsing XML and HTML, that this library is trying to implement,
there is a distinction between the different types of documents being parsed:
There are quite some rules that are different for parsing, constructing and serializing XML vs HTML documents.

So far xmldom was always "detecting" whether "the HTML rules should be applied" by looking at the current namespace. So from the first time an the HTML default namespace (http://www.w3.org/1999/xhtml) was found, every node was treated as being part of an HTML document. This misconception is the root cause for quite some reported bugs.

BREAKING CHANGE: HTML rules are no longer applied just because of the namespace, but require the mimeType argument passed to DOMParser.parseFromString(source, mimeType) to match 'text/html'. Doing so implies all rules for handling casing for tag and attribute names when parsing, creation of nodes and searching nodes.

BREAKING CHANGE: Correct the return type of DOMParser.parseFromString to Document | undefined. In case of parsing errors it was always possible that "the returned Document" has not been created. In case you are using Typescript you now need to handle those cases.

BREAKING CHANGE: The instance property DOMParser.options is no longer available, instead use the individual readonly property per option (assign, domHandler, errorHandler, normalizeLineEndings, locator, xmlns). Those also provides the default value if the option was not passed. The 'locator' option is now just a boolean (default remains true).

BREAKING CHANGE: The following methods no longer allow a (non spec compliant) boolean argument to toggle "HTML rules":

XMLSerializer.serializeToString
Node.toString
Document.toString

The following interfaces have been implemented:
DOMImplementation now implements all methods defined in the DOM spec, but not all of the behavior is implemented (see docstring):

createDocument creates an "XML Document" (prototype: Document, property type is 'xml')
createHTMLDocument creates an "HTML Document" (type/prototype: Document, property type is 'html').
- when no argument is passed or the first argument is a string, the basic nodes for an HTML structure are created, as specified
- when the first argument is false no child nodes are created

Document now has two new readonly properties as specified in the DOM spec:

contentType which is the mime-type that was used to create the document
type which is either the string literal 'xml' or 'html'

MIME_TYPE (/lib/conventions.js):

hasDefaultHTMLNamespace test if the provided string is one of the miem types that implies the default HTML namespace: text/html or application/xhtml+xml

since we can not rely on it being present in all supported runtimes. Even though the interface is the same as `Object.assign`, it behaves slightly differently from the one provided by browsers. This was extracted from xmldom#338 to support development in xmldom#367

since we can not rely on it being present in all supported runtimes. Even though the interface is the same as `Object.assign`, it behaves slightly differently from the one provided by browsers (see tests). This was extracted from #338 to support development in #367

SmartLayer · 2022-02-24T17:50:52Z

i'm very eager to test this when merged and a pre-release becomes available

karfau · 2022-02-24T19:04:28Z

@weiwu-zhang Thank you for the feedback.
I still have some WIP locally that I need to cover with tests, before this can land.
Sadly I didn't have time for this subject in the last week, but I will be back on it soon.
Of course no time line promises.

- always set `Document.type` and `Document.contentType` - `Document.createElement` properly HTML casing and (X)HTML namespacing https://dom.spec.whatwg.org/#dom-domimplementation-createhtmldocument https://dom.spec.whatwg.org/#dom-document-createelement

when `mimeType` is `text/html`. The `mimeType` can now optionally be passed to the `DOMHandler` constructor. Documented `DOMHandler` constructor and all properties. - For XML documents the XHTML and SVG mime types are preserved as expected. - `Document.documentURI` is no longer initialized with the undocumented `Locator.systemId` value. - Deprecate `DOMParserOptions.domBuilder` since state would be preserved between calls to `DOMParser.parseFromString` which can have unexpected side effects, especially since we are now using the `DOMHandler` to manage the mimeType and defaultNamespace.

to be able to copy from options provided to `DOMParser`

Instead of accessing `this.options` in `DOMParser.parseToString`, the default values are now applied in the constructor. Since the locator passed to `options` is no longer being modified, the type of the option was changed to boolean. There is no change in behavior in this commit, since truthy and falsy values are accepted as well.

Instead of accessing `this.options` in `DOMParser.parseToString`, the default values are now applied in the constructor.

Instead of accessing `this.options` in `DOMParser.parseToString`, use `this.errorHandler`.

which points to a class instead of an instance and is only meant for internal testing. BREAKING CHANGE: If you used to configure `DOMParserOptions.domBuilder`. You might be able to configure the `domHandler` instead, but should be avoided. This is only there for testing purposes.

All options are now taken care of by the constructor and are available as individual properties. Most are marked as `readonly`, some are `private`. BREAKING CHANGE: If you used `DOMParser.options` after creating an instance. You can still read the individual properties from the instance, but there is no longer a way to mutate them, so you need to really pass the required options when constructing them.

in HTML docs or namespaces

https://html.spec.whatwg.org/#boolean-attributes https://html.spec.whatwg.org/#boolean-attributes https://html.spec.whatwg.org/#attributes-3

and drop warning for boolean attributes in HTML

BREAKING CHANGE: The following methods no longer allow a (non spec compliant) boolean argument to toggle "HTML rules": - `XMLSerializer.serializeToString` - `Node.toString` - `Document.toString`

karfau added this to the next breaking/minor release milestone Dec 23, 2021

karfau mentioned this pull request Jan 22, 2022

Parse internal entity declarations in internal DTD #367

Closed

4 tasks

karfau mentioned this pull request Feb 15, 2022

feat: Add minimal Object.assign ponyfill #379

Merged

karfau changed the title ~~WIP: html document~~ feat: Only parse HTML if mime type matches Feb 16, 2022

karfau changed the title ~~feat: Only parse HTML if mime type matches~~ fix: Only use HTML rules if mimeType matches Feb 16, 2022

karfau force-pushed the 203-html-document branch from 65c8368 to 04c9f26 Compare February 16, 2022 19:44

karfau marked this pull request as ready for review February 16, 2022 19:46

karfau mentioned this pull request Feb 16, 2022

test: Add executable examples for node and typescript #317

Merged

karfau force-pushed the 203-html-document branch from 9072016 to 69ddd52 Compare February 16, 2022 21:01

karfau mentioned this pull request Feb 16, 2022

& parsing wrong when xmlns points to XHTML #203

Closed

karfau mentioned this pull request Feb 24, 2022

chore: Bump xmldom to 0.8.0 bbyars/mountebank#660

Merged

karfau force-pushed the 203-html-document branch 2 times, most recently from 973a6da to 52acd24 Compare February 28, 2022 04:16

karfau added 13 commits February 28, 2022 05:17

feat: Use minimal Object.assign ponyfill

a6b72cf

to be able to copy from options provided to `DOMParser`

fix: Prevent DOMParserOptions locator and xmlns from being mutated

fb752f0

refactor: Copy DOMParserOptions normalizeLineEndings to instance

175aa0a

Instead of accessing `this.options` in `DOMParser.parseToString`, the default values are now applied in the constructor.

refactor: Copy DOMParserOptions errorHandler to instance

6579add

Instead of accessing `this.options` in `DOMParser.parseToString`, use `this.errorHandler`.

feat: Correctly handle all case modifications

f305328

in HTML docs or namespaces

test: Remove redundant test case after rebase

980734f

style: Improve wording and drop some whitespace

380851d

fix(sax): Only apply HTML rules if mimeType is present

aae7654

karfau added 5 commits February 28, 2022 05:17

fix: Add 'use strict' to lib/entities.js

0b032bb

style: Format code

8626514

fix: Add 'use strict' to all files

fef1d79

docs: Tweak doc comments

52acd24

feat(conventions): List HTML boolean attributes and void elements

5fd4e1b

https://html.spec.whatwg.org/#boolean-attributes https://html.spec.whatwg.org/#boolean-attributes https://html.spec.whatwg.org/#attributes-3

karfau force-pushed the 203-html-document branch from 86cab47 to ceff927 Compare March 6, 2022 15:56

karfau added 5 commits March 6, 2022 16:57

fix(sax): Handle raw text elements in HTML

ceff927

and drop warning for boolean attributes in HTML

fix(conventions): Restore ES5 compatibility

dc62bf5

style: Format test code

bbe7790

refactor: Exclude escapable from isHTMLRawTextElement

ae2c7da

fix(dom): Serialize according to document type

9b46871

BREAKING CHANGE: The following methods no longer allow a (non spec compliant) boolean argument to toggle "HTML rules": - `XMLSerializer.serializeToString` - `Node.toString` - `Document.toString`

karfau force-pushed the 203-html-document branch from b1babe9 to 48f49be Compare March 6, 2022 20:42

karfau added 2 commits March 6, 2022 21:44

test(examples): Check for undefined before using document

48f49be

Merge remote-tracking branch 'upstream/master' into 203-html-document

1b88b30

karfau modified the milestones: next breaking/minor release, 0.9.0 Apr 5, 2022

karfau linked an issue Apr 5, 2022 that may be closed by this pull request

& parsing wrong when xmlns points to XHTML #203

Closed

karfau mentioned this pull request Jul 22, 2022

entity en/decoding different between parser and serializer #421

Closed

karfau merged commit 0f41739 into xmldom:master Oct 8, 2022

karfau deleted the 203-html-document branch October 8, 2022 23:31

karfau mentioned this pull request Oct 1, 2023

Avoid logging warning for missing value for some html attributes such as defer #160

Closed

karfau linked an issue Oct 1, 2023 that may be closed by this pull request

Avoid logging warning for missing value for some html attributes such as defer #160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Only use HTML rules if mimeType matches #338

fix: Only use HTML rules if mimeType matches #338

karfau commented Oct 24, 2021 •

edited

SmartLayer commented Feb 24, 2022

karfau commented Feb 24, 2022

fix: Only use HTML rules if mimeType matches #338

fix: Only use HTML rules if mimeType matches #338

Conversation

karfau commented Oct 24, 2021 • edited

SmartLayer commented Feb 24, 2022

karfau commented Feb 24, 2022

karfau commented Oct 24, 2021 •

edited