Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DecodedURL.to_uri is inconsistent with DecodedURL.normalize, .child, etc #144

Open
glyph opened this issue Dec 28, 2020 · 18 comments
Open

Comments

@glyph
Copy link
Collaborator

glyph commented Dec 28, 2020

I was pretty surprised to discover just now that DecodedURL.to_uri returns a URL object, rather than a DecodedURL. Given that almost all the other methods wrap the passed-through object, why does this one not?

I'm not sure how we get out of the compatibility jam if we decide this is wrong, but it seems wrong to me.

@mahmoud
Copy link
Member

mahmoud commented Jan 7, 2021

Yeah the wrapped surface area is not the clearest. to_uri, to_iri, and to_text are all passthroughs. to_text makes immediate sense to keep as a passthrough, and it's the one I use most. I'm less sure about the other two.

I'm curious, what were you planning on doing with the URL once you got it? The docstring of to_uri says

This is useful to do in preparation for sending a :class:URL over a network protocol.

But why one wouldn't go straight to to_text is less clear to me.

@wsanchez
Copy link
Contributor

wsanchez commented Jan 9, 2021

But why one wouldn't go straight to to_text is less clear to me.

Perhaps that's an argument to deprecate to_uri (and to_iri) altogether?

It would obliquely dodge the compatibility jam by removing the API instead of changing it in a way that's hard to reconcile.

@glyph
Copy link
Collaborator Author

glyph commented Jan 12, 2021

@mahmoud to_text converts your URI to text, for reading by a human (or embedding in a web page or a tweet or whatever), in its exact current form.

to_uri converts your URL to an ascii-only representation, which can be used to synthesize the bytes required for an HTTP request, or for storage in a system which is going to make HTTP requests with a client that may not be as sophisticated as a browser, or as Hyperlink itself, or for anywhere you need something non-unicode-aware.

It converts to a URL object rather than text or bytes directly because you still might want to manipulate it in that state, or to inspect its differences.

I like what @wsanchez is saying here though, in the sense that we should probably have better names for these. At the time, neck-deep in protocol specifications, asURI() was the intuitively obvious choice here, but in the harsh light of the following day's second system syndrome, it's clear that this nomenclature is obscure.

I am struggling to come up with something even marginally succinct, and I hate these names, but just as a starting point for discussion: as_ascii_only_for_machines and as_readable_text_for_humans ?

@glyph
Copy link
Collaborator Author

glyph commented Jan 12, 2021

I'm curious, what were you planning on doing with the URL once you got it? The docstring of to_uri says

This is useful to do in preparation for sending a :class:URL over a network protocol.

But why one wouldn't go straight to to_text is less clear to me.

It's true that I think this is a low-urgency issue precisely because 99.9% of uses of to_uri will be followed by an immediate to_text. (But that also implies that maybe we can get away with a compatibility exception.)

@mahmoud
Copy link
Member

mahmoud commented Jan 12, 2021

Interesting. Well, my first instinct is to add an as_ascii or ascii flag to to_text and then talk about the use cases in the docs a bit. 2c, etc.

@wsanchez
Copy link
Contributor

I was going to suggest as_text and as_ascii for @glyph's long names.

@wsanchez
Copy link
Contributor

And still deprecate as_uri. You can get the EncodedURL back out if you need that, but it seems we think you probably don't.

@mahmoud
Copy link
Member

mahmoud commented Jan 13, 2021

I like that, too.

@glyph
Copy link
Collaborator Author

glyph commented Jan 14, 2021

Just wanted to point out that we should not have both "as_text" and "to_text" :). And we need more than a text-based manipulation here, the thing the methods do (give you a URL) is useful.

This is an interesting bikeshed, though. Another possible color: .asciify() / .readable() ?

@glyph
Copy link
Collaborator Author

glyph commented Jan 14, 2021

Or ... .internationalize() ?

@glyph
Copy link
Collaborator Author

glyph commented Jan 14, 2021

.for_display() / .for_wire()

@glyph
Copy link
Collaborator Author

glyph commented Jan 14, 2021

Apparently the RFC calls them .ToASCII() and .ToUnicode() and maybe we should just follow suit: https://tools.ietf.org/html/rfc3490#section-4.1

@glyph
Copy link
Collaborator Author

glyph commented Jan 14, 2021

(okay okay I'm actually done for now)

@twm
Copy link
Collaborator

twm commented Jan 16, 2021

The RFC names make more sense than the status quo to me

@twm twm closed this as completed Jan 16, 2021
@twm
Copy link
Collaborator

twm commented Jan 16, 2021

Ooops, sorry about that. Tab order strikes again.

@twm twm reopened this Jan 16, 2021
@wsanchez
Copy link
Contributor

@glyph confused the hell out of me above. Oh, my bad. No, I don't was to add as_text; I was mis-remembering the name of to_text.

I mean to propose that we keep to_text as-is, which is the equivalent of the RFC's toUnicode, and add to_ascii for the equivalent of toASCII, and call it a day. Twisted-flavored names could be toText and toASCII, if we are still bothering with such things, which I would stop doing because it feels like busy-work.

(Which is to say that I think "text" is a fine synonym for "Unicode".)

@glyph
Copy link
Collaborator Author

glyph commented Jan 20, 2021

@glyph confused the hell out of me above. Oh, my bad. No, I don't was to add as_text; I was mis-remembering the name of to_text.

Ah. Whew.

I mean to propose that we keep to_text as-is, which is the equivalent of the RFC's toUnicode, and add to_ascii for the equivalent of toASCII, and call it a day.

I'm fairly certain that the equivalent of the RFC's ToUnicode is .as_iri(), and ToAscii is .as_uri(). Converting to/from actual python strings doesn't ever do character conversion.

Twisted-flavored names could be toText and toASCII, if we are still bothering with such things, which I would stop doing because it feels like busy-work.

I think the idea of the twisted-style names was for compatibility only. I would not suggest adding both on any new functionality.

(Which is to say that I think "text" is a fine synonym for "Unicode".)

Normally I'd agree but there are a number of confounding nuances here. One is that we aren't talking about literal Python types, but rather encoding notations, and also that "unicode" is specifically in contrast to "ascii" in this case... which is also text. Just different text.

@glyph
Copy link
Collaborator Author

glyph commented Jan 21, 2021

Re-reading this I realized I didn't quite answer @mahmoud's question above:

I'm curious, what were you planning on doing with the URL once you got it? The docstring of to_uri says

This is useful to do in preparation for sending a :class:URL over a network protocol.

But why one wouldn't go straight to to_text is less clear to me.

to_text doesn't actually work for what you would need in a protocol library. Consider the following.

When you issue a request for https://ايران.example.com/foo⇧bar/, your (HTTP/1.1, gosh I don't even know what these fancy new protocols look like) bytes for "Please retrieve this" look like:

GET /foo%E2%87%A7bar/
Host: xn--mgba3a4fra.example.com

This requires that the HTTP protocol implementation disassemble the URL into its constituent parts, then decode various different bits of it as ASCII; the hostname, the path, the query, and not the fragment. And of course, to issue a matching DNS query and TLS certificate validation. But only the .to_iri() version.

On the flip side, if the HTTP implementation wants to take those bytes and give you the answer to "what URL was requested", it has to do this same process in reverse. Now, arguably, the server should hand you a totally pristine URL object, as unmodified as possible from the wire; but you may want to read some decoded unicode stuff out of its constituent parts as well, so the .to_iri() (or ToUnicode) normalization needs to exist in the other direction.

There's a case to be made, I guess, that DecodedURL should always operate as if it's an IRI as you access its data... it will do this at least in part.

P.S.: While I was attempting to construct an example URI to comment on this issue, I happened across this web page, which causes a lot of problems with Hyperlink due to this bug. Hyperlink's behavior is arguably even worse than IDNA's, since we let you construct the "invalid" (not really invalid) URL, but then we don't let you convert it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants