Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get the port from the provided URL to extract function #272

Closed
cgr71ii opened this issue Sep 7, 2022 · 3 comments
Closed

Get the port from the provided URL to extract function #272

cgr71ii opened this issue Sep 7, 2022 · 3 comments

Comments

@cgr71ii
Copy link

cgr71ii commented Sep 7, 2022

Hi!

I was wondering if it is possible to get the port from the URLs when extract function is invoked (or other function). I guess it is not, since I didn't see it in the documentation and I've dug a little bit in the code and I didn't see anything related. I'm using this library in order to obtain URLs from a large list, and use those URLs in order to crawl, so I need the port in case it is defined. In case it is not possible to obtain the port, it is intended to implement this functionality?

>>> tldextract.extract('http://127.0.0.1:8080/deployed/')
ExtractResult(subdomain='', domain='127.0.0.1', suffix='', port='8080')

Thank you!

@john-kurkowski
Copy link
Owner

I took a stab at this in #273. I'm not sold on the solution as is. Feel free to chime in there. In the meantime, I suggest parsing the port with the standard library. Example:

split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
split_suffix = tldextract.extract(split_url.netloc)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}"

@john-kurkowski
Copy link
Owner

As of #274, the above workaround can be tweaked slightly to avoid parsing the string twice:

split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
- split_suffix = tldextract.extract(split_url.netloc)
+ split_suffix = tldextract.extract_urllib(split_url)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}"

@john-kurkowski
Copy link
Owner

After thinking about it, this library is focused on domain names, not every component of a URL. I defer URL parsing to Python's standard lib. I hope the workaround in the previous comment helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants