Remove optional tags for mathml that immediately precede the closing `</math>` tag #128

universemaster · 2022-06-10T16:40:33Z

In mathml a sequence of closing tags that immediately precede the closing </math> element can be omitted. See w3c Mathml spec discussion.

Since mathml is so verbose, even omitting these perhaps 2-3 closing tags per <math> equation will help, particularly when writing inline such as Let $g$ be an element of $G$.

I suspect the existing (html) parser may allow some further tags to be omitted, but I haven't investigated further.

Chrome will soon support mathml - with an intent-to-ship to come in the next month or so.

Resources that may be useful are the mathml core spec and to create some mathml try temml.

The text was updated successfully, but these errors were encountered:

DanielRuf · 2022-06-10T16:57:02Z

Hi @universemaster,

Thanks for creating the issue. This sounds like a request to support MathML. Is this what you mean?

Please keep in mind, that this library is mainly for HTML and can be extended via the configuration to support additional tags in some optimizations.

universemaster · 2022-06-10T17:18:18Z

Hi,

From here:
MathML uses (xml or) html parsing.

So, I (perhaps erroneously) thought html-minifier-terser could already parse all the tags necessary to remove closing tags that immediately precede the closing </math> tag?

Is that not correct?

DanielRuf · 2022-06-10T17:51:19Z

As the name suggests html-minifier is only for html tag. There are some hardcoded lists:

https://github.com/terser/html-minifier-terser/blob/master/src/htmlparser.js#L70-L81

https://github.com/terser/html-minifier-terser/blob/master/src/htmlminifier.js#L397-L418

math appears at https://github.com/terser/html-minifier-terser/blob/master/src/htmlminifier.js#L65

MathML Specification contains different tags, which are probably not part of the normal HTML Specification.

DanielRuf · 2022-06-10T17:57:24Z

I have read w3c/mathml#375 and the code example does not make much sense to me. XML and HTML normally only allow these variants (and so do parsers in most cases, with a few very specific exceptions):

<empty-element/>
<br>
<br/>
<element>value</element>
<element attribute="value/>

There are only very specific tags like <br> which allow the removal of /. But I don't see how this should work or be valid in a html and xml world: <elem>value. That makes not much sense from the logical perspective of a parser.

DanielRuf · 2022-06-10T18:00:12Z

Regarding this:

Some criticize mathml for being too verbose and heavy increasing the network payload.

brotli can drastically reduce the size of xml and probably MathML (I did not test or verify if this would work with MathML).

I mainly agree with the opinions in the linked issue. This would break many parsers.

DanielRuf · 2022-06-10T18:06:28Z

Here is the hardcoded list of html tags with optional end tags: https://github.com/terser/html-minifier-terser/blob/master/src/htmlminifier.js#L403

const optionalEndTags = new Set(['html', 'head', 'body', 'li', 'dt', 'dd', 'p', 'rb', 'rt', 'rtc', 'rp', 'optgroup', 'option', 'colgroup', 'caption', 'thead', 'tbody', 'tfoot', 'tr', 'td', 'th']);

alexander-akait · 2022-06-13T00:16:27Z

There are not optional tags for MathML and SVG namespaces, spec cleary say:

Any other end tag
Run these steps:

Initialize node to be the current node (the bottommost node of the stack).

If node's tag name, converted to ASCII lowercase, is not the same as the tag name of the token, then this is a parse error.

Loop: If node is the topmost element in the stack of open elements, then return. (fragment case)

If node's tag name, converted to ASCII lowercase, is the same as the tag name of the token, pop elements from the stack of open elements until node has been popped from the stack, and then return.

Set node to the previous entry in the stack of open elements.

If node is not an element in the HTML namespace, return to the step labeled loop.

Otherwise, process the token according to the rules given in the section corresponding to the current insertion mode in HTML content.

Ref: https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inforeign

Look at:

If node's tag name, converted to ASCII lowercase, is not the same as the tag name of the token, then this is a parse error.

The parser expects the end tag to follow the opening tag and they should have the same name, otherwise it is a parser error. Yes, HTML parser is working in recovery mode and will restore DOM in most of cases, but itself it is unsafe compression (and yes, there are some edge cases when DOM was broken, if you need I can provide).

I agree we should compress MathML in this package (for example - we can collapse whitespaces), because it is embeded MathML in HTML (we are html minfier so embed content should be compressed too, like we do with JS/CSS) and the same with SVG element in HTML (as minimum whitespaces too).

But SVG and MathML don't have omiting tag concept, it is XML in HTML with some limitation/abilities (depending on how you look at the problems 😄 ), so I don't think it is a good idea to omit end tag for SVG and MathML namespaces

universemaster · 2022-06-13T16:33:33Z

Hi @alexander-akait

Thanks for your references for this.

(and yes, there are some edge cases when DOM was broken, if you need I can provide)

With regard to the specific case from the title of this issue - closing tags that immediately precede the closing math </math> tag - I'd love to see examples where the DOM was not recovered in the expected way.

I've been experimenting with the regex find and replace approach below on a massive 9Mb page (before compression) with 27,000 <math> elements - and haven't observed a broken DOM - I guess I was just lucky!?

It eliminates 0.5Mb (before compression).

However, I haven't yet done performance tests for parse time against network download times.

const RegexToSelectClosingTagsForDeletion =
  /(<\/[^>]*>)+(?=<\/(?:math)>)/g;

const target = src
  .replaceAll(RegexToSelectClosingTagsForDeletion, '')

But, it seems from your references that this approach will cause parse errors.

I do appreciate your opinion not to add unsafe minifications to this library that causes parse errors (even if the DOM is identical).

We are html minifier so embed content should be compressed too

Agree, thanks for that.

(Edited post to remove mtd, mtr from code snippet. This part needs more work - this bit does silently hide parts of math tables).

alexander-akait · 2022-06-13T16:53:54Z

@universemaster

With regard to the specific case from the title of this issue - closing tags that immediately precede the closing math tag - I'd love to see examples where the DOM was not recovered in the expected way.

Can you provide your source code before compression?

There are two unsafe cases I see (but they can have more):

Recovery with HTML namespace inside MathML namespace, i.e. <math><mi><p><mi></math> (if you add closing tags you will see another DOM)
Don't forget about display: inline | block, when you remove end tag, characters after end tag collpase to text inside node (it can be some visible problems)

Shortly - SVG and MathML is XML syntax inside HTML with recoverable logic from HTML (because HTML parser is recovarable 😄). This parser error is not fatal, so maybe it is safe for you, but note - many robots/bots/etc which parse your page and allow to show your work/site/application/information in search systems can have more strict behaviour and stop parsing context after this error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove optional tags for mathml that immediately precede the closing `</math>` tag #128

Remove optional tags for mathml that immediately precede the closing `</math>` tag #128

universemaster commented Jun 10, 2022

DanielRuf commented Jun 10, 2022 •

edited

universemaster commented Jun 10, 2022

DanielRuf commented Jun 10, 2022

DanielRuf commented Jun 10, 2022 •

edited

DanielRuf commented Jun 10, 2022 •

edited

DanielRuf commented Jun 10, 2022 •

edited

alexander-akait commented Jun 13, 2022

universemaster commented Jun 13, 2022 •

edited

alexander-akait commented Jun 13, 2022

Remove optional tags for mathml that immediately precede the closing </math> tag #128

Remove optional tags for mathml that immediately precede the closing </math> tag #128

Comments

universemaster commented Jun 10, 2022

DanielRuf commented Jun 10, 2022 • edited

universemaster commented Jun 10, 2022

DanielRuf commented Jun 10, 2022

DanielRuf commented Jun 10, 2022 • edited

DanielRuf commented Jun 10, 2022 • edited

DanielRuf commented Jun 10, 2022 • edited

alexander-akait commented Jun 13, 2022

universemaster commented Jun 13, 2022 • edited

alexander-akait commented Jun 13, 2022

Remove optional tags for mathml that immediately precede the closing `</math>` tag #128

Remove optional tags for mathml that immediately precede the closing `</math>` tag #128

DanielRuf commented Jun 10, 2022 •

edited

DanielRuf commented Jun 10, 2022 •

edited

DanielRuf commented Jun 10, 2022 •

edited

DanielRuf commented Jun 10, 2022 •

edited

universemaster commented Jun 13, 2022 •

edited