Get the port from the provided URL to extract function #272

cgr71ii · 2022-09-07T18:55:53Z

Hi!

I was wondering if it is possible to get the port from the URLs when extract function is invoked (or other function). I guess it is not, since I didn't see it in the documentation and I've dug a little bit in the code and I didn't see anything related. I'm using this library in order to obtain URLs from a large list, and use those URLs in order to crawl, so I need the port in case it is defined. In case it is not possible to obtain the port, it is intended to implement this functionality?

>>> tldextract.extract('http://127.0.0.1:8080/deployed/')
ExtractResult(subdomain='', domain='127.0.0.1', suffix='', port='8080')

Thank you!

The text was updated successfully, but these errors were encountered:

john-kurkowski · 2022-09-20T19:48:53Z

I took a stab at this in #273. I'm not sold on the solution as is. Feel free to chime in there. In the meantime, I suggest parsing the port with the standard library. Example:

split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
split_suffix = tldextract.extract(split_url.netloc)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}"

john-kurkowski · 2022-10-04T19:31:56Z

As of #274, the above workaround can be tweaked slightly to avoid parsing the string twice:

split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
- split_suffix = tldextract.extract(split_url.netloc)
+ split_suffix = tldextract.extract_urllib(split_url)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}"

john-kurkowski · 2022-10-04T19:38:18Z

After thinking about it, this library is focused on domain names, not every component of a URL. I defer URL parsing to Python's standard lib. I hope the workaround in the previous comment helps!

john-kurkowski mentioned this issue Sep 20, 2022

Extract port #273

Closed

john-kurkowski closed this as not planned Won't fix, can't repro, duplicate, stale Oct 4, 2022

john-kurkowski mentioned this issue Dec 17, 2022

Request: Could scheme be made available? #205

Closed

toofar mentioned this issue Jun 3, 2023

Use url netloc as a candidate key for qute-pass qutebrowser/qutebrowser#7723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get the port from the provided URL to extract function #272

Get the port from the provided URL to extract function #272

cgr71ii commented Sep 7, 2022

john-kurkowski commented Sep 20, 2022

john-kurkowski commented Oct 4, 2022

john-kurkowski commented Oct 4, 2022

Get the port from the provided URL to extract function #272

Get the port from the provided URL to extract function #272

Comments

cgr71ii commented Sep 7, 2022

john-kurkowski commented Sep 20, 2022

john-kurkowski commented Oct 4, 2022

john-kurkowski commented Oct 4, 2022