New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: allowed_elements in BleachSanitizerFilter is different from html5lib_shim.SanitizerFilter #649
Comments
Is this still a problem with the latest version of Bleach (4.1.0)? If so, can you provide a test case that illustrates what the issue is? |
Yes, it's still a problem with the latest version. The pieces of code I shared here have not changed in those versions. You could reproduce it like this: >>> from bleach.sanitizer import BleachSanitizerFilter
>>> from html5lib.filters.sanitizer import Filter as HtmlSanitizerFilter
>>> stream = [{'type': 'Characters', 'data': 'title'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'type': 'StartTag', 'name': 'p', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': OrderedDict([((None, 'onclick'), "window.location.href='http://www.evil.com'")])}, {'type': 'Characters', 'data': 'Evil'}, {'type': 'EndTag', 'name': 'p', 'namespace': 'http://www.w3.org/1999/xhtml'}]
>>> [foo for foo in BleachSanitizerFilter(stream)]
[{'data': 'title', 'type': 'Characters'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'data': '<p onclick="window.location.href=\'http://www.evil.com\'">Evil</p>', 'type': 'Characters'}]
>>> [foo for foo in HtmlSanitizerFilter(stream)]
[{'type': 'Characters', 'data': 'title'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'type': 'Characters', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': '<p onclick="window.location.href=\'http://www.evil.com\'">'}, {'type': 'Characters', 'data': 'Evil'}, {'type': 'Characters', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': '</p>'}] Our previous usage was: from html5lib import serializer
from html5lib.filters import sanitizer
class HTMLSerializer(serializer.HTMLSerializer):
# Without this, the html5lib serializer will drop certain ending tags if
# they are deemed optional: <p>hi</p> -> <p>hi
omit_optional_tags = False
def _render_stream(stream, sanitize=False):
# Render a parsed and possibly filtered stream into a string
s = HTMLSerializer()
if sanitize:
stream = sanitizer.Filter(stream)
return s.render(stream) That ended up returning |
The BleachSanitizerFilter needs to work on a stream built by BleachHTMLTokenizer. What happens when you use that to parse whatever it is you're parsing? |
The stream I was using was created from an |
You mention alphabetized attributes. We landed a fix for #566. I think that fixes that issue for you, too. Beyond that, this is a pretty detailed issue report and it's hard for me to wrap my head around what you're trying to report. I skimmed it, but that's clearly not sufficient. I won't have time to really pour over this any time soon. |
Hmm main TLDR thing of this issue would be that |
There's no hurry in fixing this, we can just keep using old versions, I just wanted to get rid of the deprecation warnings 😅 |
Describe the bug
I was trying to replace
html5lib.filters.sanitizer.Filter
withbleach.sanitizer.BleachSanitizerFilter
(which inherits from the earlier one). We used that filter by just doingFilter(stream)
. And we find that we're getting all the tags alphabetized, which we don't really want.I found that to know if something needs to be alphabetized, there's this check:
bleach/bleach/sanitizer.py
Line 354 in e4718bd
That is only set in the
html5lib.filters.sanitizer.Filter
__init__
function by sending anallowed_elements
. It also has some default ones. But the default value for that is a set of tuples(namespace, token name)
, not some iterable of token names:https://github.com/html5lib/html5lib-python/blob/f87487a4ada2d6cf223bdd182774a01ba3c84618/html5lib/filters/sanitizer.py#L31-L32
And that is used like this:
https://github.com/html5lib/html5lib-python/blob/f87487a4ada2d6cf223bdd182774a01ba3c84618/html5lib/filters/sanitizer.py#L808
I see that
allowed_elements
is overriden in theCleaner.clean(text)
method, and a list of tags is provided.bleach/bleach/sanitizer.py
Line 179 in e4718bd
So that is consistent with the usage of the
BleachSanitizerFilter
sanitize_token
method, but that is not a consistent format with the parent.python and bleach versions (please complete the following information):
Expected behavior
I propose one of these 3 options:
BleachSanitizerFilter
should useallowed_elements
that are compatible withhtml5lib
ones, which would require a lot of refactoring and it's likely to not be backwards compatibleBleachSanitizerFilter
should force clients to send a new compatible set ofallowed_elements
by adding it as a required argument, as it's not really optional to not send it because it's not compatible with the usageBleachSanitizerFilter
should accept both formats ofallowed_elements
by doing something like this:The text was updated successfully, but these errors were encountered: