pageserver: use adaptive concurrency in secondary layer downloads #7675

jcsp · 2024-05-09T11:37:55Z

Problem

Secondary downloads are a low priority task, and intentionally do not try to max out download speeds. This is almost always fine when they are used through the life of a tenant shard as a continuous "trickle" of background downloads.

However, there are sometimes circumstances where we would like to populate a secondary location as fast as we can, within the constraint that we don't want to impact the activity of attached tenants:

During node removal, where we will need to create replacements for secondary locations on the node being removed
After a shard split, we need new secondary locations for the new shards to populate before the shards can be migrated to their final location.

Summary of changes

Add an activity() function to the remote storage interface, enabling callers to query how busy the remote storage backend is
In the secondary download code, use a very modest amount of concurrency, driven by the remote storage's state: we only use concurrency if the remote storage semaphore is 75% free, and scale the amount of concurrency used within that range.

This is not a super clever form of prioritization, but it should accomplish the key goals:

Enable secondary downloads to happen faster when the system is idle
Make secondary downloads a much lower priority than attached tenants when the remote storage is busy.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-05-09T12:16:52Z

3060 tests run: 2927 passed, 0 failed, 133 skipped (full report)

Code coverage* (full report)

functions: 31.4% (6334 of 20179 functions)
lines: 47.3% (47882 of 101230 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
539dcbf at 2024-05-13T17:48:46.323Z :recycle:}

pageserver/src/tenant/secondary/downloader.rs

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

) ## Problem Secondary downloads are a low priority task, and intentionally do not try to max out download speeds. This is almost always fine when they are used through the life of a tenant shard as a continuous "trickle" of background downloads. However, there are sometimes circumstances where we would like to populate a secondary location as fast as we can, within the constraint that we don't want to impact the activity of attached tenants: - During node removal, where we will need to create replacements for secondary locations on the node being removed - After a shard split, we need new secondary locations for the new shards to populate before the shards can be migrated to their final location. ## Summary of changes - Add an activity() function to the remote storage interface, enabling callers to query how busy the remote storage backend is - In the secondary download code, use a very modest amount of concurrency, driven by the remote storage's state: we only use concurrency if the remote storage semaphore is 75% free, and scale the amount of concurrency used within that range. This is not a super clever form of prioritization, but it should accomplish the key goals: - Enable secondary downloads to happen faster when the system is idle - Make secondary downloads a much lower priority than attached tenants when the remote storage is busy. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

…oads` AKA Pull request #7675 This partially reverts commit 5f0fbdf. We keep the part of that PR that refactored download_layer into a function.

…m always yield Err after cancel (#7866) ## Problem Ongoing hunt for secondary location shutdown hang issues. ## Summary of changes - Revert the functional changes from #7675 - Tweak a log in secondary downloads to make it more apparent when we drop out on cancellation - Modify DownloadStream's behavior to always return an Err after it has been cancelled. This _should_ not impact anything, but it makes the behavior simpler to reason about (e.g. even if the poll function somehow got called again, it could never end up in an un-cancellable state) Related #neondatabase/cloud#13576

jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels May 9, 2024

jcsp added 3 commits May 12, 2024 18:38

remote_storage: add activity() to interface

a7d38b7

pageserver: refactor secondary layer download into fn

e67a07f

pageserver: use some concurrency in secondary layer downloads

4c2be5a

jcsp force-pushed the jcsp/secondary-concurrency branch from 5fb008b to 4c2be5a Compare May 12, 2024 17:44

jcsp marked this pull request as ready for review May 13, 2024 08:44

jcsp requested a review from a team as a code owner May 13, 2024 08:44

jcsp requested a review from arpad-m May 13, 2024 08:44

arpad-m approved these changes May 13, 2024

View reviewed changes

pageserver/src/tenant/secondary/downloader.rs Show resolved Hide resolved

pageserver/src/tenant/secondary/downloader.rs Outdated Show resolved Hide resolved

Update pageserver/src/tenant/secondary/downloader.rs

539dcbf

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

jcsp enabled auto-merge (squash) May 13, 2024 16:59

jcsp merged commit 972470b into main May 13, 2024
50 checks passed

jcsp deleted the jcsp/secondary-concurrency branch May 13, 2024 17:38

jcsp mentioned this pull request May 23, 2024

pageserver: revert concurrent secondary downloads #7866

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: use adaptive concurrency in secondary layer downloads #7675

pageserver: use adaptive concurrency in secondary layer downloads #7675

jcsp commented May 9, 2024

github-actions bot commented May 9, 2024 •

edited

pageserver: use adaptive concurrency in secondary layer downloads #7675

pageserver: use adaptive concurrency in secondary layer downloads #7675

Conversation

jcsp commented May 9, 2024

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented May 9, 2024 • edited

3060 tests run: 2927 passed, 0 failed, 133 skipped (full report)

Code coverage* (full report)

github-actions bot commented May 9, 2024 •

edited