Add exponential backoff to linkcheck #6629

tbekolay · 2019-08-02T19:24:29Z

Is your feature request related to a problem? Please describe.
We link to Github PR/issues for each entry in our changelog, which is included in our documentation. That means that we fire off a hundred or so checks to various pages on Github. Whenever it's not running smoothly, we end up with timeouts, which slows down our development until Github is more reliable again.

Describe the solution you'd like
Allow the linkcheck builder to retry requests a few times (maybe 3, or a configurable amount) with exponential backoff. https://www.peterbe.com/plog/best-practice-with-retries-with-requests gives a quick and easy implementation of how to do this with requests.get, which is what linkcheck currently uses to get pages.

Describe alternatives you've considered
Another thing that might help would be to collect all the links to check in one step, then sort them such that requests to the same domain happen more sequentially than requests to different domains. I.e., you would do one request to each unique domain first before you try a request to a domain you've already tried before.

The text was updated successfully, but these errors were encountered:

tk0miya · 2019-08-04T16:19:51Z

Reasonable. Could you make a PR please?

francoisfreitag · 2020-11-07T12:27:11Z

Hi. I’m trying to solving this.

The linkcheck behavior is start worker threads, then queue all links it encounters in a queue.Queue. Threads consume that queue to offer parallelization.

I built a solution based on priority queues detailed below, but I’m not satisfied with it and would like to increase the scope to use asyncio. That has ramifications for the project and would like to discuss it before getting too deep into the implementation.

PriorityQueue

Replace the work Queue with a queue.PriorityQueue. The priority is the next timestamp when the link should be checked.

The priority is time.time() for new links.
When a server replies with a 429, the thread receiving that response computes the next priority:

if max_retries for that link are reached, bail out. max_retries is user controlled, per domain, can be different from linkcheck_retries.
insert link back into the PriorityQueue, new priority computed as follows:

Retry-After is an int: priority = time.time() + retry_after
Retry-After is an HTTP date, priority = http_date_to_unix_timestamp(retry_after)
Retry-After absent or invalid: priority = time.time() + exponential_backoff_starting_at_60

Queue link back into the PriorityQueue

Each worker thread pulls from the queue. If priority is in the future, requeue the message with the same priority and go to sleep.

Issues

(existing) The number of threads limits concurrency. With a reasonable number of threads (e.g. 3), all threads may be waiting for a response and the work queue does not get consumed. Asynchronously checking links would increase concurrency by keeping the CPU busy.
(existing) Use of multiple threads requires thread-safe operations and data structures, which adds complexity. For example, the results writer functions keep opening and closing the result files because each thread needs to write to them.
(new) With the need to sleep to honor rate-limits, threads exhibit two sub-optimal behaviors:
1. wake when there is nothing to do, only to go back to sleep
2. be sleeping when a new link is queued up
The added scheduling to handle 429 is a poor substitute for what an event loop has built-in. For example, loop.call_at() would be very convenient.

Suggested changes

Replace requests with aiohttp.
1. Use a wrapper that makes async code synchronous for sync use cases (get event loop, queue the request, run event loop until complete).
2. Adapt existing code that expects a requests.Response to use an aiohttp.ClientResponse. Both look pretty similar.
3. Make compatibility wrapper for arguments where a requests object was expected and aiohttp expects a different input. Consider REQUESTS_CA_BUNDLE, tls_cacerts, auth_info for the linkcheck_auth setting.
Deprecate and eventually drop linkcheck_workers setting.

Next steps

Replace requests with aiohttp.
Use asynchronous concurrency for linkcheck:
Run an event loop in a separate thread.
The builder queues async functions to check each link onto the event loop with asyncio.run_coroutine_threadsafe())
Solve this issue by teaching the check coroutine to sleep when it receives a 429.

If there’s interest in that plan, I’m happy to break the next steps into separate issues and tackle them.

Possible extension

To squeeze out even more performance for linkcheck, threads could be introduced again: a ThreadPoolExecutor could execute the coroutines. That introduces the complexity of sharing data across threads and is left for future work.

tk0miya · 2020-11-08T08:35:52Z

If there’s interest in that plan, I’m happy to break the next steps into separate issues and tackle them.

My large concern is who maintains it. I'm not good at asyncio and aiohttp. So it would be nice if you become a maintainer of the new linkcheck builder. What do you think?

Note: We have to care the new one is working fine on Windows too.

francoisfreitag · 2020-11-08T09:22:10Z

Thanks for the quick feedback. I don’t mind maintaining linkcheck and helping out with aiohttp. I use linkcheck in factory_boy and think it’s a very useful extension generally.

My day job is as a web developer (mostly Python and Django). I’ve been doing that for about 7 years, I’m pretty familiar with the Web and Python.

I’m new to async, but eager to work with it. This change is a good opportunity to grow more familiar with async, and fixing the (hopefully few) issues arising from this change will be a great learning experience.

If not being experienced with async beyond a couple personal testing projects is a big concern, I’m okay sticking with the multi-threaded solution, the priority queue and requests. It just does not seem the best way to solve the problem, and async offers exciting possibilities.

tk0miya · 2020-11-08T15:36:54Z

Sounds good :-) Let's moving to new architecture!

Follow the Retry-After header if present, otherwise use an exponential back-off.

Follow the Retry-After header if present, otherwise use an exponential back-off. Allow users to decide the wait time between retries and when to bail out.

Follow the Retry-After header if present, otherwise use an exponential back-off.

Follow the Retry-After header if present, otherwise use an exponential back-off. Close sphinx-doc#7388

Fix #6629: linkcheck: Handle rate-limiting

tbekolay added the type:enhancement enhance or introduce a new feature label Aug 2, 2019

tk0miya added the extensions label Aug 4, 2019

tk0miya added this to the 2.3.0 milestone Aug 4, 2019

tk0miya modified the milestones: 2.3.0, 2.4.0 Dec 14, 2019

tk0miya modified the milestones: 2.4.0, 3.0.0 Feb 5, 2020

tk0miya modified the milestones: 3.0.0, 4.0.0 Mar 14, 2020

mgeier mentioned this issue Sep 27, 2020

Add support for rate limit detection / retries @ linkcheck #7388

Closed

francoisfreitag mentioned this issue Nov 8, 2020

Enable async capabilities for HTTP requests #8391

Open

4 tasks

francoisfreitag added a commit to francoisfreitag/sphinx that referenced this issue Nov 20, 2020

Fix sphinx-doc#6629: linkcheck: Handle rate-limiting

28c8667

Follow the Retry-After header if present, otherwise use an exponential back-off.

francoisfreitag added a commit to francoisfreitag/sphinx that referenced this issue Nov 20, 2020

Fix sphinx-doc#6629: linkcheck: Handle rate-limiting

3cc2170

Follow the Retry-After header if present, otherwise use an exponential back-off.

francoisfreitag added a commit to francoisfreitag/sphinx that referenced this issue Nov 21, 2020

Fix sphinx-doc#6629: linkcheck: Handle rate-limiting

aa10384

Follow the Retry-After header if present, otherwise use an exponential back-off.

francoisfreitag added a commit to francoisfreitag/sphinx that referenced this issue Nov 21, 2020

Fix sphinx-doc#6629: linkcheck: Handle rate-limiting

71b20ff

Follow the Retry-After header if present, otherwise use an exponential back-off.

francoisfreitag added a commit to francoisfreitag/sphinx that referenced this issue Nov 21, 2020

Fix sphinx-doc#6629: linkcheck: Handle rate-limiting

e78ddf1

Follow the Retry-After header if present, otherwise use an exponential back-off.

francoisfreitag added a commit to francoisfreitag/sphinx that referenced this issue Nov 21, 2020

Fix sphinx-doc#6629: linkcheck: Handle rate-limiting

02bf152

Follow the Retry-After header if present, otherwise use an exponential back-off.

francoisfreitag mentioned this issue Nov 21, 2020

Fix #6629: linkcheck: Handle rate-limiting #8467

Merged

francoisfreitag added a commit to francoisfreitag/sphinx that referenced this issue Nov 22, 2020

Fix sphinx-doc#6629: linkcheck: Handle rate-limiting

4d3c93b

Follow the Retry-After header if present, otherwise use an exponential back-off.

francoisfreitag added a commit to francoisfreitag/sphinx that referenced this issue Nov 22, 2020

Fix sphinx-doc#6629: linkcheck: Handle rate-limiting

3238ac3

Follow the Retry-After header if present, otherwise use an exponential back-off. Close sphinx-doc#7388

tk0miya modified the milestones: 4.0.0, 3.4.0 Nov 25, 2020

tk0miya closed this as completed in 6b90a63 Nov 25, 2020

tk0miya added a commit that referenced this issue Nov 25, 2020

Merge pull request #8467 from francoisfreitag/rate-limit

5a0e123

Fix #6629: linkcheck: Handle rate-limiting

github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exponential backoff to linkcheck #6629

Add exponential backoff to linkcheck #6629

tbekolay commented Aug 2, 2019

tk0miya commented Aug 4, 2019

francoisfreitag commented Nov 7, 2020

tk0miya commented Nov 8, 2020

francoisfreitag commented Nov 8, 2020 •

edited

tk0miya commented Nov 8, 2020

Add exponential backoff to linkcheck #6629

Add exponential backoff to linkcheck #6629

Comments

tbekolay commented Aug 2, 2019

tk0miya commented Aug 4, 2019

francoisfreitag commented Nov 7, 2020

PriorityQueue

Issues

Suggested changes

Next steps

Possible extension

tk0miya commented Nov 8, 2020

francoisfreitag commented Nov 8, 2020 • edited

tk0miya commented Nov 8, 2020

francoisfreitag commented Nov 8, 2020 •

edited