Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To encode or not encode - best practices for "uncommon" uri characters, including whitespaces (%20)? #104

Open
busche opened this issue May 2, 2023 · 1 comment

Comments

@busche
Copy link

busche commented May 2, 2023

Dear all,

I am currently struggeling with a whitespace problem which I guess should not be that complicated - so I probably missing something here.

MWE:

import rfc3986.builder
rfc3986.builder.URIBuilder.from_uri("scheme:").extend_path("path 1").extend_path("path2").geturl()
# outout: 'scheme:/path 1/path2'
rfc3986.builder.URIBuilder.from_uri("scheme:").extend_path("path 1/path2").geturl()
# outout: 'scheme:/path 1/path2'
rfc3986.builder.URIBuilder.from_uri("scheme:path 1").extend_path("path2").geturl()
# outout: 'scheme:/path%201/path2'

therefore: If i am having a whitespace in the from_uri-part, it gets escaped by %20, whereby having the whitespace as part of the parameter to extend_path, it gets used as is.

From the broader scope, I am storing URIs in a database which get constructed on one component "from scratch" (containing whitespaces ...), whereas they are passed in a url-encoded - conformant manner in another component.
I already figured out that there is an equivalence when passing maybe-url-encoded strings to from_uri:

from_uri_a=rfc3986.builder.URIBuilder.from_uri("scheme:/path 1/path2").finalize()
from_uri_b=rfc3986.builder.URIBuilder.from_uri("scheme:/path%201/path2").finalize()
from_uri_a == from_uri_b
# is True

My main goal is to store the URIs in a future-proof way in my database and from the requirements I am having it does not really make a big difference whether or not I am storing the URLs encoded or not - but from the broader scope I am unsure whether the current implementation is desired or not (aka. a bug or a feature).

From the rfc, sec. 2.4, I guess that an encoding should take place in the extend_path method:

Under normal circumstances, the only time when octets within a URI
are percent-encoded is during the process of producing the URI from
its component parts. This is when an implementation determines which
of the reserved characters are to be used as subcomponent delimiters
and which can be safely used as data. Once produced, a URI is always
in its percent-encoded form.

Any thoughts on this?

@sigmavirus24
Copy link
Collaborator

As the person who wrote this, I feel like it's a bug. Thanks for the report. I'd be happy to review a fix if you'd like to send one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants