Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Relative url is removed when the allowed protocol is https #662

Closed
LocNguyen-KMS opened this issue May 9, 2022 · 5 comments · Fixed by #669
Closed

bug: Relative url is removed when the allowed protocol is https #662

LocNguyen-KMS opened this issue May 9, 2022 · 5 comments · Fixed by #669
Labels
Milestone

Comments

@LocNguyen-KMS
Copy link

LocNguyen-KMS commented May 9, 2022

Describe the bug

bleach.clean does not keep the relative url when the allowed protocol is https

Python and bleach versions

  • Python Version: 2.7.18
  • Bleach Version: 3.3.1

Steps to reproduce the behavior

from bleach.sanitizer import Cleaner
BleachCleaner = Cleaner(
    tags=['a', 'br'],
    attributes={'a': 'href'},
    styles=[],
    protocols=['https'],
    strip=True,
    strip_comments=True
)

description = 'create new study <a href="/path/to/study">Mental study</a>'
BleachCleaner.clean(description)

Expected behavior
In bleach version 2.0.0, the relative url is not removed from the result:
'create new study <a href="/path/to/study">Mental study</a>'

Actual behavior
In bleach version 3.3.1, the relative url is removed from the result:
'create new study <a>Mental study</a>'

Additional context
Actual result in bleach 3.3.1
Screen Shot 2022-05-09 at 11 14 27

Expected result in bleach 2.0.0
Screen Shot 2022-05-09 at 11 21 25

@LocNguyen-KMS LocNguyen-KMS added the untriaged Bug reports that haven't been triaged label May 9, 2022
@willkg
Copy link
Member

willkg commented May 10, 2022

I verified this is still true of Bleach 5.0.0:

> python
Python 3.8.10 (default, Jun 24 2021, 10:31:00) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bleach
>>> bleach.clean("<a href=\"/path/to/thing\">thing</a>", tags=["a"], attributes={"a": "href"})
'<a href="/path/to/thing">thing</a>'
>>> bleach.clean("<a href=\"/path/to/thing\">thing</a>", tags=["a"], attributes={"a": "href"}, protocols=["https"])
'<a>thing</a>'

@willkg willkg added this to the v5.0.1 milestone May 10, 2022
@willkg willkg added clean and removed untriaged Bug reports that haven't been triaged labels May 31, 2022
@willkg
Copy link
Member

willkg commented May 31, 2022

I think we should fix this. I can think of a couple of options:

  1. Always allow links with no protocol or maybe always allow links with no protocol if http or https is allowed. I can't think of cases where this would be bad, but Bleach has been around for ages and this is the first time I've seen this issue come up, so maybe I'm not imaginative.
  2. Add support for a protocol value that covers protocol-less links. Maybe call it "self" like in CSP? Specifying "self" as an allowed protocol would match protocol-less links. This would allow developers to specify whether or not to allow links that don't specify a protocol. We would add "self" to the list of ALLOWED_PROTOCOLS. I'm not thrilled about having a non-protocol in the protocol list, but I think it'd be ok for now.

I think I like option 2 more.

@g-k What do you think? Can you think of a scenario where option 2 does the wrong thing?

@willkg
Copy link
Member

willkg commented May 31, 2022

I'm skimming old issues and this is related to PR #565 and issue #611.

@g-k
Copy link
Collaborator

g-k commented Jun 2, 2022

It's reasonable for users to expect bleach to match browser and 2.x behavior, but it'd be good to provide users an option to opt out too.

So Option 2 seems preferable. We can update the default schemes later if we want to allow scheme-less URIs by default too.

Can you think of a scenario where option 2 does the wrong thing?

Maybe if there's another protocol registered to self? Nothing else is registered to self:// though, so that might be an option.

We should clarify self vs schema-less links like //thirdparty.com/foo.js too (empty str seems better there but still kind of wrong to use a falsy value).

There is this caveat in the URL spec that I think we should follow:

A URL cannot have a username/password/port if its host is null or the empty string, or its scheme is "file".

Checking bleach and python urlparse behavior against the wpt url tests would probably be good too.

Also, there's an older bug around rewriting the base scheme to javascript:. Bleach doesn't have access to the base URL, but maybe it should take it as an optional arg?

CSP's base-uri mitigates the risk of base overwriting too, so we should probably mention that stuff in the docs e.g. probably don't allow the base element with schema-less or relative links.

@willkg
Copy link
Member

willkg commented Jun 2, 2022

I should have read more of the code. We have this currently:

bleach/bleach/sanitizer.py

Lines 491 to 494 in ed06d4e

# If there's no protocol/scheme specified, then assume it's "http"
# and see if that's allowed
if "http" in allowed_protocols:
return value

I think we should just change that to:

if "http" in allowed_protocols or "https" in allowed_protocols:
    return value

Your comment has a bunch of other tasks:

The caveat in the url spec is interesting, but I don't think we need to do anything here unless we find out browsers are doing the wrong thing.

I'll write up a new issue to check the Bleach vendored urlparse against wpt url tests.

I'm not sure I understand the thought about base url as an optional arg. That sounds like it should be a new issue.

I'm not sure I understand the thought about CSP's base-uri, either, but that also sounds like it should be a new issue.

willkg added a commit that referenced this issue Jun 3, 2022
Previously, we allowed scheme-less urls if "http" was allowed. This
expands that to also support "https".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants