normalisation of urls containing non-ascii domains is broken and loses data #23

wbolster · 2016-01-15T13:09:37Z

Initial parsing works:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment')
URIReference(scheme='http', authority='æåëý.com', path='/path', query='query', fragment='fragment')

Subsequent normalisation silently loses data:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment').normalize()
URIReference(scheme='http', authority=None, path='/path', query='query', fragment='fragment')

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2016-01-16T02:33:51Z

Correct. We do not yet handle IRIs. (RFC 3987)

wbolster · 2016-01-19T17:13:25Z

Fwiw, preprocessing by replacing the host name part with its IDNA-encoded (xn--…) equivalent using the url parsing routines from the urllib3 package, before passing it to uri_reference() sort of "works" as a work-around.

sigmavirus24 modified the milestone: IRI Support May 16, 2017

sigmavirus24 mentioned this issue Feb 4, 2019

Use rfc3986.validator.Validator for parse_url urllib3/urllib3#1531

Merged

sethmlarson self-assigned this Feb 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalisation of urls containing non-ascii domains is broken and loses data #23

normalisation of urls containing non-ascii domains is broken and loses data #23

wbolster commented Jan 15, 2016

sigmavirus24 commented Jan 16, 2016

wbolster commented Jan 19, 2016

normalisation of urls containing non-ascii domains is broken and loses data #23

normalisation of urls containing non-ascii domains is broken and loses data #23

Comments

wbolster commented Jan 15, 2016

sigmavirus24 commented Jan 16, 2016

wbolster commented Jan 19, 2016