Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: allowed_elements in BleachSanitizerFilter is different from html5lib_shim.SanitizerFilter #649

Closed
facundochaud-eb opened this issue Mar 10, 2022 · 7 comments · Fixed by #693
Labels
untriaged Bug reports that haven't been triaged
Milestone

Comments

@facundochaud-eb
Copy link

facundochaud-eb commented Mar 10, 2022

Describe the bug

I was trying to replace html5lib.filters.sanitizer.Filter with bleach.sanitizer.BleachSanitizerFilter (which inherits from the earlier one). We used that filter by just doing Filter(stream). And we find that we're getting all the tags alphabetized, which we don't really want.

I found that to know if something needs to be alphabetized, there's this check:

if token["name"] in self.allowed_elements:

That is only set in the html5lib.filters.sanitizer.Filter __init__ function by sending an allowed_elements. It also has some default ones. But the default value for that is a set of tuples (namespace, token name), not some iterable of token names:
https://github.com/html5lib/html5lib-python/blob/f87487a4ada2d6cf223bdd182774a01ba3c84618/html5lib/filters/sanitizer.py#L31-L32

And that is used like this:
https://github.com/html5lib/html5lib-python/blob/f87487a4ada2d6cf223bdd182774a01ba3c84618/html5lib/filters/sanitizer.py#L808

            if ((namespace, name) in self.allowed_elements or
                (namespace is None and
                 (namespaces["html"], name) in self.allowed_elements)):

I see that allowed_elements is overriden in the Cleaner.clean(text) method, and a list of tags is provided.

allowed_elements=self.tags,

So that is consistent with the usage of the BleachSanitizerFilter sanitize_token method, but that is not a consistent format with the parent.

python and bleach versions (please complete the following information):

  • Python Version: 3.7
  • Bleach Version: 2.3.1

Expected behavior

I propose one of these 3 options:

  • BleachSanitizerFilter should use allowed_elements that are compatible with html5lib ones, which would require a lot of refactoring and it's likely to not be backwards compatible
  • BleachSanitizerFilter should force clients to send a new compatible set of allowed_elements by adding it as a required argument, as it's not really optional to not send it because it's not compatible with the usage
    class BleachSanitizerFilter(html5lib_shim.SanitizerFilter):
        """html5lib Filter that sanitizes text
    
        This filter can be used anywhere html5lib filters can be used.
    
        """
        def __init__(
            self,
            source,
            allowed_elements,
            attributes=ALLOWED_ATTRIBUTES,
            strip_disallowed_elements=False,
            strip_html_comments=True,
            **kwargs,
        ):
  • BleachSanitizerFilter should accept both formats of allowed_elements by doing something like this:
    def _is_token_allowed(token):
        name = token["name"]
        namespace = token.get("namespace")
        first_allowed_element = next(iter(self.allowed_elements))
    
        if isinstance(first_allowed_element, tuple):
            return (
                (namespace, name) in self.allowed_elements
                or (
                    namespace is None and
                    and (namespaces["html"], name) in self.allowed_elements
                )
            )
        return name in self.allowed_elements
@facundochaud-eb facundochaud-eb added the untriaged Bug reports that haven't been triaged label Mar 10, 2022
@willkg
Copy link
Member

willkg commented Mar 10, 2022

Is this still a problem with the latest version of Bleach (4.1.0)? If so, can you provide a test case that illustrates what the issue is?

@facundochaud-eb
Copy link
Author

Yes, it's still a problem with the latest version. The pieces of code I shared here have not changed in those versions.

You could reproduce it like this:

>>> from bleach.sanitizer import BleachSanitizerFilter
>>> from html5lib.filters.sanitizer import Filter as HtmlSanitizerFilter
>>> stream = [{'type': 'Characters', 'data': 'title'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'type': 'StartTag', 'name': 'p', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': OrderedDict([((None, 'onclick'), "window.location.href='http://www.evil.com'")])}, {'type': 'Characters', 'data': 'Evil'}, {'type': 'EndTag', 'name': 'p', 'namespace': 'http://www.w3.org/1999/xhtml'}]
>>> [foo for foo in BleachSanitizerFilter(stream)]
[{'data': 'title', 'type': 'Characters'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'data': '<p onclick="window.location.href=\'http://www.evil.com\'">Evil</p>', 'type': 'Characters'}]
>>> [foo for foo in HtmlSanitizerFilter(stream)]
[{'type': 'Characters', 'data': 'title'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'type': 'Characters', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': '<p onclick="window.location.href=\'http://www.evil.com\'">'}, {'type': 'Characters', 'data': 'Evil'}, {'type': 'Characters', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': '</p>'}]

Our previous usage was:

from html5lib import serializer
from html5lib.filters import sanitizer


class HTMLSerializer(serializer.HTMLSerializer):
    # Without this, the html5lib serializer will drop certain ending tags if
    # they are deemed optional: <p>hi</p> -> <p>hi
    omit_optional_tags = False


def _render_stream(stream, sanitize=False):
    # Render a parsed and possibly filtered stream into a string
    s = HTMLSerializer()
    if sanitize:
        stream = sanitizer.Filter(stream)
    return s.render(stream)

That ended up returning title <p>Evil</p>, but if I replace the sanitizer filter by bleach's one I get title &lt;p onclick="window.location.href='http://www.evil.com'"&gt;Evil&lt;/p&gt;

@willkg
Copy link
Member

willkg commented Mar 10, 2022

The BleachSanitizerFilter needs to work on a stream built by BleachHTMLTokenizer. What happens when you use that to parse whatever it is you're parsing?

@facundochaud-eb
Copy link
Author

The stream I was using was created from an html5lib.HTMLParser. I replaced it with a BleachHTMLParser which uses BleachHTMLTokenizer. Got the same results

@willkg
Copy link
Member

willkg commented Mar 10, 2022

You mention alphabetized attributes. We landed a fix for #566. I think that fixes that issue for you, too.

Beyond that, this is a pretty detailed issue report and it's hard for me to wrap my head around what you're trying to report. I skimmed it, but that's clearly not sufficient. I won't have time to really pour over this any time soon.

@facundochaud-eb
Copy link
Author

Hmm main TLDR thing of this issue would be that self.allowed_elements is incompatible between bleach and html5lib, and I propose 3 solutions for that. I can open a PR for the last proposal if you like, but I don't know if that would be the go-to option for you

@facundochaud-eb
Copy link
Author

There's no hurry in fixing this, we can just keep using old versions, I just wanted to get rid of the deprecation warnings 😅

@willkg willkg added this to the 5.0.2 (tentative) milestone Oct 27, 2022
willkg added a commit that referenced this issue Jan 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
untriaged Bug reports that haven't been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants