Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worry the checker is actually downloading the content at the checked URL #106

Open
markcmiller86 opened this issue Feb 1, 2024 · 9 comments

Comments

@markcmiller86
Copy link

For some reason, this goes pretty slowly. I am working from this document and it takes quite a while to complete a check. Next, I notice that on .pdf files, it stalls for longer, especially the one at the ftp link.

So, this has me worried that it is actually fully getting the content to check the link. I've seen similar in the Sphinx URL checking feature too. It should really just be get a header at each URL and not the full content.

Is this something you've looked into?

@vsoch
Copy link
Collaborator

vsoch commented Feb 1, 2024

We should do head instead of get, I agree. We haven't looked into it but can.

@vsoch
Copy link
Collaborator

vsoch commented Feb 1, 2024

Would you like me to update the branch we are working on to try it out?

@markcmiller86
Copy link
Author

ChatGPT suggests something along the lines of

import requests

def is_url_working(url):
    try:
        response = requests.head(url)
        # You might also want to check for redirects (response.status_code == 302)
        if response.status_code == 200:
            return True
        elif response.status_code == 405: # head requests disallowed
            retval = False
            response = requests.get(url, stream=True)
            if response.status_code == 200:
                retval = True
            response.close()  # Make sure to close the response
            return retval
    except requests.RequestException as e:
        print(f"An error occurred: {e}")
        return False

# Example usage
url = "http://example.com"
if is_url_working(url):
    print(f"The URL {url} is working.")
else:
    print(f"The URL {url} is not working.")

@vsoch
Copy link
Collaborator

vsoch commented Feb 1, 2024

Oh geez chatGPT? 🙃 The first lines I see issues:

The retval and response.close don’t make sense.

I appreciate the suggestion but I don’t think the quality of code from AI tools is very good. It’s mostly copy pasting some poor souls code from somewhere else on GitHub. I’m happy to write this with my own knowledge and careful inspection of core docs and library code to get the functionality I want.

@vsoch
Copy link
Collaborator

vsoch commented Feb 1, 2024

But I have to take it back - it does look like response.close() is useful for requests.get() ! Geez, I've been writing in Python a long time and I just don't see it very often. So I learned something from ChatGPT! I appreciate the post, and I'll try to be more open minded about it (even if I don't use it)!

@markcmiller86
Copy link
Author

I'll try to be more open minded about it (even if I don't use it)!

So, I just happened to see this Q&A with Linus Torvalds about AI tools in coding...

@vsoch
Copy link
Collaborator

vsoch commented Feb 1, 2024

haha I totally watched that! I'll watch again tonight with new context.

@markcmiller86
Copy link
Author

So, the more I think about this, the more I think HEAD requests are the most likely to be useful for links to non .html content (e.g. .pdf or other binary content a link references. I think for most .html content, the cost to download is likely minimal.

@vsoch
Copy link
Collaborator

vsoch commented Feb 3, 2024

Agree! Let me work hard today (just presented at FOSDEM) and maybe I can do some work on this later if I'm productive!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants