New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get the port from the provided URL to extract function #272
Comments
I took a stab at this in #273. I'm not sold on the solution as is. Feel free to chime in there. In the meantime, I suggest parsing the port with the standard library. Example: split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
split_suffix = tldextract.extract(split_url.netloc)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}" |
As of #274, the above workaround can be tweaked slightly to avoid parsing the string twice: split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
- split_suffix = tldextract.extract(split_url.netloc)
+ split_suffix = tldextract.extract_urllib(split_url)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}" |
After thinking about it, this library is focused on domain names, not every component of a URL. I defer URL parsing to Python's standard lib. I hope the workaround in the previous comment helps! |
Hi!
I was wondering if it is possible to get the port from the URLs when
extract
function is invoked (or other function). I guess it is not, since I didn't see it in the documentation and I've dug a little bit in the code and I didn't see anything related. I'm using this library in order to obtain URLs from a large list, and use those URLs in order to crawl, so I need the port in case it is defined. In case it is not possible to obtain the port, it is intended to implement this functionality?Thank you!
The text was updated successfully, but these errors were encountered: