Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set user agent to avoid errors in link_checker and load_data #950

Closed
lukehsiao opened this issue Feb 17, 2020 · 2 comments · Fixed by #952
Closed

Set user agent to avoid errors in link_checker and load_data #950

lukehsiao opened this issue Feb 17, 2020 · 2 comments · Fixed by #952
Labels
done in pr Already done in a PR

Comments

@lukehsiao
Copy link
Contributor

lukehsiao commented Feb 17, 2020

Bug Report

zola check currently reports errors for links where the server returns an error (e.g., 403, 400) when there is not a user agent in the request headers. This is expected behavior, as the current link_checker doesn't set any. Can we allow the link checker to set a user agent, and/or provide a default zola user agent?

Environment

Ubuntu 18.04.4

Zola version:
v0.10.0

Expected Behavior

Tell us what should have happened.

Some servers return errors when the user agent header is missing. For example, when running the link_checker on a URL such as https://arxiv.org/abs/1906.01113, the link_checker will report a 403 and declare this as a dead link. This can be seen using an example test case:

components/link_checker/src/lib.rs

    #[test]
    fn user_agent_test() {
        let url = "https://arxiv.org/abs/1906.01113";

        let res = check_url(url, &LinkChecker::default());
        assert!(res.is_valid());
        assert!(res.code.is_some());
        assert!(res.error.is_none());
    }

This same test case will pass if a user agent is included, e.g.:

pub fn check_url(url: &str, config: &LinkChecker) -> LinkResult {
    {
        let guard = LINKS.read().unwrap();
        if let Some(res) = guard.get(url) {
            return res.clone();
        }
    }

    let mut headers = HeaderMap::new();
    headers.insert(ACCEPT, "text/html".parse().unwrap());
    headers.append(ACCEPT, "*/*".parse().unwrap());
    headers.append(USER_AGENT, "zola/0.10.0 link_checker".parse().unwrap());
...

Without a USER_AGENT, the test will fail.

We could mitigate this issue by:

  • Setting a default user agent (such as a a zola-specific user agent shown above)
  • Allowing users to specify a user-agent via some configuration

For a default user-agent, we probably do not want a hard-coded string, and rather could just follow the reqwest example:

https://docs.rs/reqwest/0.10.1/reqwest/struct.ClientBuilder.html#method.user_agent

// Name your user agent after your app?
static APP_USER_AGENT: &str = concat!(
    env!("CARGO_PKG_NAME"),
    "/",
    env!("CARGO_PKG_VERSION"),
);

let client = reqwest::Client::builder()
    .user_agent(APP_USER_AGENT)
    .build()?;

Some other example URLs which return 400/403s without a user agent:

@lukehsiao lukehsiao changed the title Mitigating link_checker 403s by providing a user agent Mitigating link_checker errors by providing a user agent Feb 17, 2020
@17cupsofcoffee
Copy link
Contributor

17cupsofcoffee commented Feb 17, 2020

This doesn't just affect the link checker - load_data also doesn't seem to send a user agent header any more. APIs like GitHub and Crates.io require that header to be set for you to be able to get a successful response, meaning if a Zola site is trying to pull data from one of these APIs, the build will fail.

To give a concrete example, I tried to build https://arewegameyet.rs on the new version of Zola and the build fails due to Crates.io returning a 403.

EDIT: To add a little context, the reason this has broken now is due to Reqwest 0.10 changing their defaults.

@lukehsiao
Copy link
Contributor Author

Good point. I will rename this issue to capture the more general problem.

@lukehsiao lukehsiao changed the title Mitigating link_checker errors by providing a user agent Set user agent to avoid errors in link_checker and load_data Feb 18, 2020
@Keats Keats added the done in pr Already done in a PR label Feb 19, 2020
@Keats Keats closed this as completed in 661bd9c Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done in pr Already done in a PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants