Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wbr element shouldn't be balanced #488

Open
jcushman opened this issue Nov 27, 2019 · 3 comments
Open

wbr element shouldn't be balanced #488

jcushman opened this issue Nov 27, 2019 · 3 comments

Comments

@jcushman
Copy link

The <wbr> element is balanced by bleach.clean even though it is an empty element.

Using the list of empty tags from MDN:

In [6]: empty_elements = {
   ...:     'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'meta', 'param', 'source', 'track', 'wbr'
   ...: }

In [7]: html = "".join("<%s>" % s for s in empty_elements)

In [8]: import bleach

In [9]: bleach.clean(html, tags=empty_elements)
Out[9]: '<param><source><hr><base><track><area><wbr></wbr><br><img><keygen></keygen><link><input><meta><embed>'

The output includes <wbr></wbr> when it should just be <wbr> like the others. keygen has the same problem, but that's deprecated so I'm not sure if it's worth including.

@g-k
Copy link
Collaborator

g-k commented Dec 4, 2019

hmm yeah I can reproduce. wbr is listed as a self closing tag on:

(("area", "br", "embed", "img", "keygen", "wbr"),
self.startTagVoidFormatting),

and should have:

token["selfClosingAcknowledged"] = True

but I get

{'type': 'StartTag', 'name': 'wbr', 'namespace': None, 'data': OrderedDict()}
{'type': 'EndTag', 'name': 'wbr', 'namespace': None}

at https://github.com/mozilla/bleach/blob/master/bleach/sanitizer.py#L271 so I'm thinking one of these things might be going on:

  • html5lib incorrectly parses it as a self closing tag (but didn't see an upstream issue)
  • tagOpenState or another method in html5lib_shim.py leaves the parser in a bad state that causes it to not be recognized as a self closing tag
  • the tags arg doesn't pass the tag a self closing tag

but I'll need to find more time to look into it further.

@g-k g-k added the clean label Sep 16, 2020
@g-k
Copy link
Collaborator

g-k commented Sep 16, 2020

OK this is a bug in html5lib (v1.1 at least):

» python
Python 3.8.2 (default, Mar 26 2020, 12:39:19)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bleach._vendor.html5lib as html5lib
>>> html5lib.__version__
'1.1'
>>> html5lib.serialize(html5lib.parseFragment('<area>')) # this is correct
'<area>'
>>> html5lib.serialize(html5lib.parseFragment('<wbr>')) # should be <wbr>
'<wbr></wbr>'
>>> html5lib.serialize(html5lib.parseFragment('<keygen>')) # HTML 5.2 deprecates the tag
'<keygen></keygen>'
>>> html5lib.serialize(html5lib.parseFragment('<menuitem>')) # https://github.com/html5lib/html5lib-python/issues/203 mentions this but https://developer.mozilla.org/en-US/docs/Web/HTML/Element/menuitem shows non-void examples and says HTML 5.2 deprecates it
'<menuitem></menuitem>'

the upstream issue is html5lib/html5lib-python#203
upstream PR for wbr html5lib/html5lib-python#395

Not sure what html5lib's position on deprecated elements is.

g-k pushed a commit that referenced this issue Sep 16, 2020
g-k pushed a commit that referenced this issue Sep 16, 2020
g-k pushed a commit that referenced this issue Sep 16, 2020
@g-k g-k added the html5lib label Jan 25, 2021
@ambv
Copy link

ambv commented Mar 2, 2023

This is now addressed in html5lib:
html5lib/html5lib-python#395

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants