fqdn() to always include suffix if private suffix enabled and private suffix exists #300

elliotwutingfeng · 2023-07-12T04:43:01Z

Closes #178

Changes

If private suffixes are enabled and there is a private suffix, fqdn() will always return the suffix + domain and/or subdomain if they exist as well.

Misc

This can be a step towards fixing Semantics of registered_domain property for private domains #138. Would need a consensus on whether registered_domain should strictly refer to domain + public suffix regardless of whether private suffixes are enabled.
Is there a better way to expose is_private than putting it in the existing NamedTuple?
The CLI parser behaviour remains unchanged; it only prints out the str attributes of the NamedTuple.

john-kurkowski · 2023-07-15T22:18:27Z

Is the point of this PR mainly to fix the .fqdn property in #178? Is there a way to fix the property without a breaking change? I don't like that extracting a public suffix now gives a different result than extracting a private suffix. I like treating public and private (when enabled) as similarly as possible.

# entire URL is a public suffix
>>> tldextract.extract("gov.au", include_psl_private_domains=True)
ExtractResult(subdomain='', domain='', suffix='gov.au') # not domain='gov', suffix='au'?

# entire URL is a private suffix
>>> tldextract.extract("blogspot.com", include_psl_private_domains=True)
ExtractResult(subdomain='', domain='blogspot', suffix='com')

Also, I don't know how useful it is to check if an entire URL was a suffix, but you could check that before this PR. After this PR, I don't think you can.

>>> result = tldextract.extract("blogspot.com", include_psl_private_domains=True)
>>> is_suffix = result.suffix and not (result.subdomain or result.domain)

elliotwutingfeng · 2023-07-16T00:33:07Z

Is there a way to fix the property without a breaking change?

This is a breaking change so perhaps we should use a flag to optionally enable this behaviour.

This can be done with an opt-in flag (not yet implemented as of now) to enable the new behaviour.

tldextract.extract("blogspot.com", include_psl_private_domains=False, opt_in_flag_name=False)

If the caller doesn't set this flag to True, the old behaviour applies. If you find this approach feasible, what would be an appropriate name to replace opt_in_flag_name?

john-kurkowski · 2023-07-16T01:28:56Z

What desirable new behavior would the flag let users opt into? Is the flag just a bugfix for #178? If so, breaking changes, new configuration options, and something the caller has to do all seem like a chainsaw, for what calls for a knife. I'm afraid to maintain 2 different string splitting experiences or more options to the already too many to construct a TLDExtract.

elliotwutingfeng · 2023-07-16T01:43:50Z

Yes, this flag is just a bugfix for #178. If an opt-in flag is used, none of the existing code out there should be affected, and hence it would not be a breaking change.

However I do agree that this complicates the string splitting algorithm; having to handle 2 different cases. ~~If it is too much, I think #178 (which is currently still open) should be closed as "not planned".~~

After re-reading the issue, I realised your solution of using the is_private property is simpler and there won't be a need ti opt-in, but it would change the output of FQDN for the private domains.

john-kurkowski · 2023-07-17T21:11:30Z

After re-reading the issue, I realised your solution of using the is_private property is simpler and there won't be a need ti opt-in, but it would change the output of FQDN for the private domains.

You got it. That approach is what I'm interested in. #178 is a legit bug, so I'll keep it open for now. The FQDN of private domains was never intended to be blank, so changing that output is desirable.

john-kurkowski · 2023-08-24T23:57:37Z

This PR changed a lot since my last comment! I think it's much closer to the solution I was seeking in my last comment. Nice!

My only remaining concern is the semi-breaking change of adding another field to this project's class ExtractResult(NamedTuple). This project's docs previously advocated iterating through the fields. With this PR, anybody still iterating through the namedtuple is going to build some weird strings or more likely raise a TypeError. That is, without guards e.g. isinstance(str) in this PR's updated docs. I noted a similar compatibility issue on the abandoned #273.

I wonder if there's a better way? Do we add an attribute to the namedtuple, instead of extending its fields? Do we bump the breaking version number of the project? Do we move away from namedtuple?

I don't want to make this PR's fix go on much longer. Just noodling what compatibility I'm comfortable with. 🤔

john-kurkowski · 2023-09-07T01:22:40Z

Do we add an attribute to the namedtuple, instead of extending its fields?

I've played around with this, and it doesn't feel viable. __slots__ and __new__ cannot be overridden on a namedtuple.

Make … an attribute of ExtractResult but not a member of the tuple, like urllib.parse.urlsplit does

I still haven't investigated what the stdlib is doing differently in e.g. urllib.parse.urlsplit's example. I'm not sure it's worth the hoop jumping for this project.

Do we bump the breaking version number of the project?

Doing this for every field added to the namedtuple seems like a lot. I'd just steer people away from iterating over all fields, unless they really know what they're doing.

All this to say, I'm leaning toward merging this as is, releasing, and seeing how niche the compatibility issues are.

banagale · 2023-10-06T22:06:12Z

My only remaining concern is the semi-breaking change of adding another field to this project's class ExtractResult(NamedTuple). This project's docs previously advocated iterating through the fields. With this PR, anybody still iterating through the namedtuple is going to build some weird strings or more likely raise a TypeError. That is, without guards e.g. isinstance(str) in this PR's updated docs. I noted a similar compatibility issue on the abandoned #273.

I had some code using tldextract.extract(url)[1:] and hit this TypeError! Came across it despite not having test coverage on this portion of the product.

Glad this package is getting improved and am okay with the breaking change,

I don't know if I would have caught it, but if this had kicked the project up to 4.0.0 (to help highlight the breaking change). I might have checked the release notes more carefully. I realize not everyone uses that convention, and that this is not such a deeply breaking change to possibly warrant it. But throwing it out there nonetheless.

As it was, I just updated to site_str: str = '.'.join(tldextract.extract(url)[1:3]) and got back to it.

Thanks for any and all efforts to maintain and further extend this great package.

john-kurkowski · 2023-10-11T08:51:33Z

@banagale thank you for weighing in! Per #305, 3.6.0 is yanked and republished as 4.0.0. Your tuple slicing code will work in 3.x and 4.0.0.

However, I've also published 5.0.0, which moves away from a tuple return type entirely. In 5.0.0, you will need to directly reference the fields you're interested in.

ext = tldextract.extract(url)
'.'.join((ext.domain, ext.suffix))

banagale · 2023-10-11T20:30:55Z

@banagale thank you for weighing in! Per #305, 3.6.0 is yanked and republished as 4.0.0. Your tuple slicing code will work in 3.x and 4.0.0.

However, I've also published 5.0.0, which moves away from a tuple return type entirely. In 5.0.0, you will need to directly reference the fields you're interested in.
ext = tldextract.extract(url)
'.'.join((ext.domain, ext.suffix))

Cool. Makes good sense.

https://build.opensuse.org/request/show/1119465 by user mia + anag+factory - Update to 5.0.1: Bugfixes: * Indicate MD5 not used in a security context (FIPS compliance) #gh/john-kurkowski/tldextract#309 Misc.: * Increase typecheck aggression - Changes in 5.0.0: Breaking Changes: * Migrate `ExtractResult` from `namedtuple` to `dataclass` #gh/john-kurkowski/tldextract#306 Bugfixes: * Drop support for EOL Python 3.7 - Changes in 4.0.0: Breaking Bugfixes: * Always include suffix if private suffix enabled and private suffix exists #gh/john-kurkowski/tldextract#300 - Changes in 3.5.0: Features: * Support IPv6 addresses #gh/john-kurkowski/tldextract#298 Bugfixes: * Accept only 4 decimal octet IPv4 addresses #gh/john-kurkowski/tldextract#292 * Support IPv4 addresses with unicode dots * Reject IPv4 addresses with trailing whitespace

elliotwutingfeng added 3 commits July 12, 2023 12:05

Parse previous known suffix if entire url is private suffix

4810264

Add test case for nested private suffix

9eac143

Optimize edge case where there is only one suffix within private suffix.

cd3d2cb

elliotwutingfeng added 2 commits August 1, 2023 14:54

Fix wrong fqdn for private suffixes.

1f02945

Merge branch 'master' into private

c577983

elliotwutingfeng changed the title ~~Accept next largest suffix if entire URL is private suffix~~ fqdn() to always include suffix if private suffix enabled and private suffix exists Aug 1, 2023

john-kurkowski added 3 commits August 24, 2023 16:59

Merge remote-tracking branch 'origin/master' into private

a0fd1ad

fixup! Fix wrong fqdn for private suffixes.

6b96753

Futureproof ExtractResult iteration

a059397

john-kurkowski merged commit 789f6ef into john-kurkowski:master Sep 13, 2023
1 check passed

elliotwutingfeng deleted the private branch September 14, 2023 16:22

This was referenced Sep 20, 2023

Version 3.6.0 introduced a breaking change without bumping to 4.0.0 #305

Closed

Migrate ExtractResult from namedtuple to dataclass #306

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fqdn() to always include suffix if private suffix enabled and private suffix exists #300

fqdn() to always include suffix if private suffix enabled and private suffix exists #300

elliotwutingfeng commented Jul 12, 2023 •

edited

john-kurkowski commented Jul 15, 2023

elliotwutingfeng commented Jul 16, 2023 •

edited

john-kurkowski commented Jul 16, 2023

elliotwutingfeng commented Jul 16, 2023 •

edited

john-kurkowski commented Jul 17, 2023

john-kurkowski commented Aug 24, 2023

john-kurkowski commented Sep 7, 2023

banagale commented Oct 6, 2023

john-kurkowski commented Oct 11, 2023

banagale commented Oct 11, 2023

fqdn() to always include suffix if private suffix enabled and private suffix exists #300

fqdn() to always include suffix if private suffix enabled and private suffix exists #300

Conversation

elliotwutingfeng commented Jul 12, 2023 • edited

Changes

Misc

john-kurkowski commented Jul 15, 2023

elliotwutingfeng commented Jul 16, 2023 • edited

john-kurkowski commented Jul 16, 2023

elliotwutingfeng commented Jul 16, 2023 • edited

john-kurkowski commented Jul 17, 2023

john-kurkowski commented Aug 24, 2023

john-kurkowski commented Sep 7, 2023

banagale commented Oct 6, 2023

john-kurkowski commented Oct 11, 2023

banagale commented Oct 11, 2023

elliotwutingfeng commented Jul 12, 2023 •

edited

elliotwutingfeng commented Jul 16, 2023 •

edited

elliotwutingfeng commented Jul 16, 2023 •

edited