Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Only use HTML rules if mimeType matches #338

Merged
merged 25 commits into from Oct 8, 2022

Conversation

karfau
Copy link
Member

@karfau karfau commented Oct 24, 2021

In the living specs for parsing XML and HTML, that this library is trying to implement,
there is a distinction between the different types of documents being parsed:
There are quite some rules that are different for parsing, constructing and serializing XML vs HTML documents.

So far xmldom was always "detecting" whether "the HTML rules should be applied" by looking at the current namespace. So from the first time an the HTML default namespace (http://www.w3.org/1999/xhtml) was found, every node was treated as being part of an HTML document. This misconception is the root cause for quite some reported bugs.

BREAKING CHANGE: HTML rules are no longer applied just because of the namespace, but require the mimeType argument passed to DOMParser.parseFromString(source, mimeType) to match 'text/html'. Doing so implies all rules for handling casing for tag and attribute names when parsing, creation of nodes and searching nodes.

BREAKING CHANGE: Correct the return type of DOMParser.parseFromString to Document | undefined. In case of parsing errors it was always possible that "the returned Document" has not been created. In case you are using Typescript you now need to handle those cases.

BREAKING CHANGE: The instance property DOMParser.options is no longer available, instead use the individual readonly property per option (assign, domHandler, errorHandler, normalizeLineEndings, locator, xmlns). Those also provides the default value if the option was not passed. The 'locator' option is now just a boolean (default remains true).

BREAKING CHANGE: The following methods no longer allow a (non spec compliant) boolean argument to toggle "HTML rules":

  • XMLSerializer.serializeToString
  • Node.toString
  • Document.toString

The following interfaces have been implemented:
DOMImplementation now implements all methods defined in the DOM spec, but not all of the behavior is implemented (see docstring):

  • createDocument creates an "XML Document" (prototype: Document, property type is 'xml')
  • createHTMLDocument creates an "HTML Document" (type/prototype: Document, property type is 'html').
    • when no argument is passed or the first argument is a string, the basic nodes for an HTML structure are created, as specified
    • when the first argument is false no child nodes are created

Document now has two new readonly properties as specified in the DOM spec:

  • contentType which is the mime-type that was used to create the document
  • type which is either the string literal 'xml' or 'html'

MIME_TYPE (/lib/conventions.js):

  • hasDefaultHTMLNamespace test if the provided string is one of the miem types that implies the default HTML namespace: text/html or application/xhtml+xml

@karfau karfau added this to the next breaking/minor release milestone Dec 23, 2021
karfau added a commit to karfau/xmldom that referenced this pull request Feb 15, 2022
since we can not rely on it being present in all supported runtimes.
Even though the interface is the same as `Object.assign`,
it behaves slightly differently from the one provided by browsers.

This was extracted from xmldom#338 to support development in xmldom#367
karfau added a commit that referenced this pull request Feb 15, 2022
since we can not rely on it being present in all supported runtimes.
Even though the interface is the same as `Object.assign`,
it behaves slightly differently from the one provided by browsers (see tests).

This was extracted from #338 to support development in #367
@karfau karfau changed the title WIP: html document feat: Only parse HTML if mime type matches Feb 16, 2022
@karfau karfau changed the title feat: Only parse HTML if mime type matches fix: Only use HTML rules if mimeType matches Feb 16, 2022
@karfau karfau marked this pull request as ready for review February 16, 2022 19:46
@SmartLayer
Copy link

i'm very eager to test this when merged and a pre-release becomes available

@karfau
Copy link
Member Author

karfau commented Feb 24, 2022

@weiwu-zhang Thank you for the feedback.
I still have some WIP locally that I need to cover with tests, before this can land.
Sadly I didn't have time for this subject in the last week, but I will be back on it soon.
Of course no time line promises.

- always set `Document.type` and `Document.contentType`
- `Document.createElement` properly HTML casing and (X)HTML namespacing

https://dom.spec.whatwg.org/#dom-domimplementation-createhtmldocument
https://dom.spec.whatwg.org/#dom-document-createelement
when `mimeType` is `text/html`.
The `mimeType` can now optionally be passed to the `DOMHandler` constructor.
Documented `DOMHandler` constructor and all properties.

- For XML documents the XHTML and SVG mime types are preserved as expected.
- `Document.documentURI` is no longer initialized with the undocumented `Locator.systemId` value.
- Deprecate `DOMParserOptions.domBuilder` since state would be preserved between calls to `DOMParser.parseFromString` which can have unexpected side effects, especially since we are now using the `DOMHandler` to manage the mimeType and defaultNamespace.
to be able to copy from options provided to `DOMParser`
Instead of accessing `this.options` in `DOMParser.parseToString`,
the default values are now applied in the constructor.
Since the locator passed to `options` is no longer being modified,
the type of the option was changed to boolean.
There is no change in behavior in this commit,
since truthy and falsy values are accepted as well.
Instead of accessing `this.options` in `DOMParser.parseToString`,
the default values are now applied in the constructor.
Instead of accessing `this.options` in `DOMParser.parseToString`,
use `this.errorHandler`.
which points to a class instead of an instance and is only meant for internal testing.

BREAKING CHANGE: If you used to configure `DOMParserOptions.domBuilder`.
You might be able to configure the `domHandler` instead, but should be avoided.
This is only there for testing purposes.
All options are now taken care of by the constructor and are available as individual properties.
Most are marked as `readonly`, some are `private`.

BREAKING CHANGE: If you used `DOMParser.options` after creating an instance.
You can still read the individual properties from the instance,
but there is no longer a way to mutate them, so you need to really pass the required options when constructing them.
and drop warning for boolean attributes in HTML
BREAKING CHANGE: The following methods no longer allow a (non spec compliant) boolean argument to toggle "HTML rules":
- `XMLSerializer.serializeToString`
- `Node.toString`
- `Document.toString`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants