Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove optional tags for mathml that immediately precede the closing </math> tag #128

Open
universemaster opened this issue Jun 10, 2022 · 9 comments

Comments

@universemaster
Copy link

In mathml a sequence of closing tags that immediately precede the closing </math> element can be omitted. See w3c Mathml spec discussion.

Since mathml is so verbose, even omitting these perhaps 2-3 closing tags per <math> equation will help, particularly when writing inline such as Let $g$ be an element of $G$.

I suspect the existing (html) parser may allow some further tags to be omitted, but I haven't investigated further.

Chrome will soon support mathml - with an intent-to-ship to come in the next month or so.

Resources that may be useful are the mathml core spec and to create some mathml try temml.

@DanielRuf
Copy link
Contributor

DanielRuf commented Jun 10, 2022

Hi @universemaster,

Thanks for creating the issue. This sounds like a request to support MathML. Is this what you mean?

Please keep in mind, that this library is mainly for HTML and can be extended via the configuration to support additional tags in some optimizations.

@universemaster
Copy link
Author

Hi,

From here:
MathML uses (xml or) html parsing.

So, I (perhaps erroneously) thought html-minifier-terser could already parse all the tags necessary to remove closing tags that immediately precede the closing </math> tag?

Is that not correct?

@DanielRuf
Copy link
Contributor

As the name suggests html-minifier is only for html tag. There are some hardcoded lists:

https://github.com/terser/html-minifier-terser/blob/master/src/htmlparser.js#L70-L81

https://github.com/terser/html-minifier-terser/blob/master/src/htmlminifier.js#L397-L418

math appears at https://github.com/terser/html-minifier-terser/blob/master/src/htmlminifier.js#L65

MathML Specification contains different tags, which are probably not part of the normal HTML Specification.

@DanielRuf
Copy link
Contributor

DanielRuf commented Jun 10, 2022

I have read w3c/mathml#375 and the code example does not make much sense to me. XML and HTML normally only allow these variants (and so do parsers in most cases, with a few very specific exceptions):

<empty-element/>
<br>
<br/>
<element>value</element>
<element attribute="value/>

There are only very specific tags like <br> which allow the removal of /. But I don't see how this should work or be valid in a html and xml world: <elem>value. That makes not much sense from the logical perspective of a parser.

@DanielRuf
Copy link
Contributor

DanielRuf commented Jun 10, 2022

Regarding this:

Some criticize mathml for being too verbose and heavy increasing the network payload.

brotli can drastically reduce the size of xml and probably MathML (I did not test or verify if this would work with MathML).

I mainly agree with the opinions in the linked issue. This would break many parsers.

@DanielRuf
Copy link
Contributor

DanielRuf commented Jun 10, 2022

Here is the hardcoded list of html tags with optional end tags: https://github.com/terser/html-minifier-terser/blob/master/src/htmlminifier.js#L403

const optionalEndTags = new Set(['html', 'head', 'body', 'li', 'dt', 'dd', 'p', 'rb', 'rt', 'rtc', 'rp', 'optgroup', 'option', 'colgroup', 'caption', 'thead', 'tbody', 'tfoot', 'tr', 'td', 'th']);

@alexander-akait
Copy link
Collaborator

There are not optional tags for MathML and SVG namespaces, spec cleary say:

Any other end tag
Run these steps:

Initialize node to be the current node (the bottommost node of the stack).

If node's tag name, converted to ASCII lowercase, is not the same as the tag name of the token, then this is a parse error.

Loop: If node is the topmost element in the stack of open elements, then return. (fragment case)

If node's tag name, converted to ASCII lowercase, is the same as the tag name of the token, pop elements from the stack of open elements until node has been popped from the stack, and then return.

Set node to the previous entry in the stack of open elements.

If node is not an element in the HTML namespace, return to the step labeled loop.

Otherwise, process the token according to the rules given in the section corresponding to the current insertion mode in HTML content.

Ref: https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inforeign

Look at:

If node's tag name, converted to ASCII lowercase, is not the same as the tag name of the token, then this is a parse error.

The parser expects the end tag to follow the opening tag and they should have the same name, otherwise it is a parser error. Yes, HTML parser is working in recovery mode and will restore DOM in most of cases, but itself it is unsafe compression (and yes, there are some edge cases when DOM was broken, if you need I can provide).

I agree we should compress MathML in this package (for example - we can collapse whitespaces), because it is embeded MathML in HTML (we are html minfier so embed content should be compressed too, like we do with JS/CSS) and the same with SVG element in HTML (as minimum whitespaces too).

But SVG and MathML don't have omiting tag concept, it is XML in HTML with some limitation/abilities (depending on how you look at the problems 😄 ), so I don't think it is a good idea to omit end tag for SVG and MathML namespaces

@universemaster
Copy link
Author

universemaster commented Jun 13, 2022

Hi @alexander-akait

Thanks for your references for this.

(and yes, there are some edge cases when DOM was broken, if you need I can provide)

With regard to the specific case from the title of this issue - closing tags that immediately precede the closing math </math> tag - I'd love to see examples where the DOM was not recovered in the expected way.

I've been experimenting with the regex find and replace approach below on a massive 9Mb page (before compression) with 27,000 <math> elements - and haven't observed a broken DOM - I guess I was just lucky!?

It eliminates 0.5Mb (before compression).

However, I haven't yet done performance tests for parse time against network download times.

const RegexToSelectClosingTagsForDeletion =
  /(<\/[^>]*>)+(?=<\/(?:math)>)/g;

const target = src
  .replaceAll(RegexToSelectClosingTagsForDeletion, '')

But, it seems from your references that this approach will cause parse errors.

I do appreciate your opinion not to add unsafe minifications to this library that causes parse errors (even if the DOM is identical).

We are html minifier so embed content should be compressed too

Agree, thanks for that.

(Edited post to remove mtd, mtr from code snippet. This part needs more work - this bit does silently hide parts of math tables).

@alexander-akait
Copy link
Collaborator

@universemaster

With regard to the specific case from the title of this issue - closing tags that immediately precede the closing math tag - I'd love to see examples where the DOM was not recovered in the expected way.

Can you provide your source code before compression?

There are two unsafe cases I see (but they can have more):

  1. Recovery with HTML namespace inside MathML namespace, i.e. <math><mi><p><mi></math> (if you add closing tags you will see another DOM)
  2. Don't forget about display: inline | block, when you remove end tag, characters after end tag collpase to text inside node (it can be some visible problems)

Shortly - SVG and MathML is XML syntax inside HTML with recoverable logic from HTML (because HTML parser is recovarable 😄). This parser error is not fatal, so maybe it is safe for you, but note - many robots/bots/etc which parse your page and allow to show your work/site/application/information in search systems can have more strict behaviour and stop parsing context after this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants