Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantics of registered_domain property for private domains #138

Open
tuler opened this issue Sep 19, 2017 · 5 comments
Open

Semantics of registered_domain property for private domains #138

tuler opened this issue Sep 19, 2017 · 5 comments
Labels
icebox: needs clarification OP, please clarify (or someone else who desires the change)

Comments

@tuler
Copy link

tuler commented Sep 19, 2017

Suppose the following url: tuler.github.io
github.io is a private domain in the PSL.

When parsed with include_psl_private_domains=True we get subdomain='', domain=tuler, suffix=github.io.

The registered_domain property just joins domain and suffix, giving me tuler.github.io, but IMHO it still should be github.io, as this is the domain registered with the registrar, and can be found in a whois query.

One problem to implement this is that when a URL is parsed, we can't know if the parsed domain is a private domain or a ICANN domain, because this is not kept internally when the PSL is read.

Any thoughts?

@john-kurkowski
Copy link
Owner

(Note to self, if we need to track public vs. private at runtime, #66 is a requirement.)

@john-kurkowski
Copy link
Owner

Yeah, I bet most will associate it with registrar registration, as you have.

In my mind, tldextract has been consistent, working as designed, via a more abstract interpretation of "registered." Excluding private domains, GitHub registered github.io with a registrar, who controlled the domain. Including private domains, GitHub user tuler "registered" tuler.github.io with GitHub, who controlled the domain.

I have no strong evidence if my interpretation is broadly useful. It was for a very specific case, when I originally wrote this lib. Or maybe both interpretations are useful.

@tuler
Copy link
Author

tuler commented Sep 25, 2017

I see your point.

Nonetheless, keeping runtime information regarding each domain from the PSL can be useful to handle this appropriately by the application. Something like a is_private method, or a is_private flag added to the ExtractResult.

@john-kurkowski
Copy link
Owner

Yes, at the very least we should do #66 and expose is_private.

I'd then consider a new registered_domain-like that was constant in the face of not/private. Just needs a new name.

Renaming today's registered_domain is also a possibility, but then we're burdened with backwards compat and legacy association with today's wording.

@john-kurkowski
Copy link
Owner

The PR for #66 currently tracks the source of an extraction, whether the official public suffix list, the private domains in the public suffix list, or user-provided extra suffixes. We haven't figured out how to expose that yet. It's tricky, since it's a namedtuple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
icebox: needs clarification OP, please clarify (or someone else who desires the change)
Projects
None yet
Development

No branches or pull requests

3 participants