Skip to content

Commit

Permalink
scrubber: add separate find/purge garbage commands (#5409)
Browse files Browse the repository at this point in the history
## Problem

The previous garbage cleanup functionality relied on doing a dry run,
inspecting logs, and then doing a deletion. This isn't ideal, because
what one actually deletes might not be the same as what one saw in the
dry run. It's also risky UX to rely on presence/absence of one CLI flag
to control deletion: ideally the deletion command should be totally
separate from the one that scans the bucket.

Related: #5037

## Summary of changes

This is a major re-work of the code, which results in a net decrease in
line count of about 600. The old code for removing garbage was build
around the idea of doing discovery and purging together: a
"delete_batch_producer" sent batches into a deleter. The new code writes
out both procedures separately, in functions that use the async streams
introduced in #5176 to achieve
fast concurrent access to S3 while retaining the readability of a single
function.

- Add `find-garbage`, which writes out a JSON file of tenants/timelines
to purge
- Add `purge-garbage` which consumes the garbage JSON file, applies some
extra validations, and does deletions.
- The purge command will refuse to execute if the garbage file indicates
that only garbage was found: this guards against classes of bugs where
the scrubber might incorrectly deem everything garbage.
- The purge command defaults to only deleting tenants that were found in
"deleted" state in the control plane. This guards against the risk that
using the wrong console API endpoint could cause all tenants to appear
to be missing.

Outstanding work for a future PR:
- Make whatever changes are needed to adapt to the Console/Control Plane
separation.
- Make purge even safer by checking S3 `Modified` times for
index_part.json files (not doing this here, because it will depend on
the generation-aware changes for finding index_part.json files)

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Co-authored-by: Shany Pozin <shany@neon.tech>
  • Loading branch information
3 people committed Oct 26, 2023
1 parent 39b148b commit 7c16b52
Show file tree
Hide file tree
Showing 12 changed files with 727 additions and 1,433 deletions.
71 changes: 39 additions & 32 deletions s3_scrubber/README.md
Expand Up @@ -25,57 +25,64 @@ _This section is only relevant if using a command that requires access to Neon's

### Commands

#### `tidy`
#### `find-garbage`

Iterate over S3 buckets for storage nodes, checking their contents and removing the data not present in the console. Node S3 data that's not removed is then further checked for discrepancies and, sometimes, validated.

Unless the global `--delete` argument is provided, this command only dry-runs and logs
what it would have deleted.

```
tidy --node-kind=<safekeeper|pageserver> [--depth=<tenant|timeline>] [--skip-validation]
```
Walk an S3 bucket and cross-reference the contents with the Console API to identify data for
tenants or timelines that should no longer exist.

- `--node-kind`: whether to inspect safekeeper or pageserver bucket prefix
- `--depth`: whether to only search for deletable tenants, or also search for
deletable timelines within active tenants. Default: `tenant`
- `--skip-validation`: skip additional post-deletion checks. Default: `false`
- `--output-path`: filename to write garbage list to. Default `garbage.json`

For a selected S3 path, the tool lists the S3 bucket given for either tenants or both tenants and timelines — for every found entry, console API is queried: any deleted or missing in the API entity is scheduled for deletion from S3.
This command outputs a JSON file describing tenants and timelines to remove, for subsequent
processing by the `purge-garbage` subcommand.

If validation is enabled, only the non-deleted tenants' ones are checked.
For pageserver, timelines' index_part.json on S3 is also checked for various discrepancies: no files are removed, even if there are "extra" S3 files not present in index_part.json: due to the way pageserver updates the remote storage, it's better to do such removals manually, stopping the corresponding tenant first.
**Note that the garbage list format is not stable. The output of `find-garbage` is only
intended for use by the exact same version of the tool running `purge-garbage`**

Command examples:
Example:

`env SSO_ACCOUNT_ID=369495373322 REGION=eu-west-1 BUCKET=neon-dev-storage-eu-west-1 CLOUD_ADMIN_API_TOKEN=${NEON_CLOUD_ADMIN_API_STAGING_KEY} CLOUD_ADMIN_API_URL=[url] cargo run --release -- tidy --node-kind=safekeeper`
`env SSO_ACCOUNT_ID=123456 REGION=eu-west-1 BUCKET=my-dev-bucket CLOUD_ADMIN_API_TOKEN=${NEON_CLOUD_ADMIN_API_STAGING_KEY} CLOUD_ADMIN_API_URL=[url] cargo run --release -- find-garbage --node-kind=pageserver --depth=tenant --output-path=eu-west-1-garbage.json`

`env SSO_ACCOUNT_ID=369495373322 REGION=us-east-2 BUCKET=neon-staging-storage-us-east-2 CLOUD_ADMIN_API_TOKEN=${NEON_CLOUD_ADMIN_API_STAGING_KEY} CLOUD_ADMIN_API_URL=[url] cargo run --release -- tidy --node-kind=pageserver --depth=timeline`
#### `purge-garbage`

When dry run stats look satisfying, use `-- --delete` before the `tidy` command to
disable dry run and run the binary with deletion enabled.
Consume a garbage list from `find-garbage`, and delete the related objects in the S3 bucket.

See these lines (and lines around) in the logs for the final stats:
- `--input-path`: filename to read garbage list from. Default `garbage.json`.
- `--mode`: controls whether to purge only garbage that was specifically marked
deleted in the control plane (`deletedonly`), or also to purge tenants/timelines
that were not present in the control plane at all (`deletedandmissing`)

- `Finished listing the bucket for tenants`
- `Finished active tenant and timeline validation`
- `Total tenant deletion stats`
- `Total timeline deletion stats`
This command learns region/bucket details from the garbage file, so it is not necessary
to pass them on the command line

## Current implementation details
Example:

- The tool does not have any peristent state currently: instead, it creates very verbose logs, with every S3 delete request logged, every tenant/timeline id check, etc.
Worse, any panic or early errored tasks might force the tool to exit without printing the final summary — all affected ids will still be in the logs though. The tool has retries inside it, so it's error-resistant up to some extent, and recent runs showed no traces of errors/panics.
`env SSO_ACCOUNT_ID=123456 cargo run --release -- purge-garbage --node-kind=pageserver --depth=tenant --input-path=eu-west-1-garbage.json`

- Instead of checking non-deleted tenants' timelines instantly, the tool attempts to create separate tasks (futures) for that,
complicating the logic and slowing down the process, this should be fixed and done in one "task".
Add the `--delete` argument before `purge-garbage` to enable deletion. This is intentionally
not provided inline in the example above to avoid accidents. Without the `--delete` flag
the purge command will log all the keys that it would have deleted.

- The tool does uses only publicly available remote resources (S3, console) and does not access pageserver/safekeeper nodes themselves.
Yet, its S3 set up should be prepared for running on any pageserver/safekeeper node, using node's S3 credentials, so the node API access logic could be implemented relatively simply on top.
#### `scan-metadata`

## Cleanup procedure:
Walk objects in a pageserver S3 bucket, and report statistics on the contents.

```
env SSO_ACCOUNT_ID=123456 REGION=eu-west-1 BUCKET=my-dev-bucket CLOUD_ADMIN_API_TOKEN=${NEON_CLOUD_ADMIN_API_STAGING_KEY} CLOUD_ADMIN_API_URL=[url] cargo run --release -- scan-metadata
Timelines: 31106
With errors: 3
With warnings: 13942
With garbage: 0
Index versions: 2: 13942, 4: 17162
Timeline size bytes: min 22413312, 1% 52133887, 10% 56459263, 50% 101711871, 90% 191561727, 99% 280887295, max 167535558656
Layer size bytes: min 24576, 1% 36879, 10% 36879, 50% 61471, 90% 44695551, 99% 201457663, max 275324928
Timeline layer count: min 1, 1% 3, 10% 6, 50% 16, 90% 25, 99% 39, max 1053
```

### Pageserver preparations
## Cleaning up running pageservers

If S3 state is altered first manually, pageserver in-memory state will contain wrong data about S3 state, and tenants/timelines may get recreated on S3 (due to any layer upload due to compaction, pageserver restart, etc.). So before proceeding, for tenants/timelines which are already deleted in the console, we must remove these from pageservers.

Expand Down
183 changes: 16 additions & 167 deletions s3_scrubber/src/checks.rs
@@ -1,178 +1,27 @@
use std::collections::{hash_map, HashMap, HashSet};
use std::sync::Arc;
use std::time::Duration;
use std::collections::HashSet;

use anyhow::Context;
use aws_sdk_s3::Client;
use tokio::task::JoinSet;
use tracing::{error, info, info_span, warn, Instrument};
use tracing::{error, info, warn};

use crate::cloud_admin_api::{BranchData, CloudAdminApiClient, ProjectId};
use crate::delete_batch_producer::DeleteProducerStats;
use crate::{download_object_with_retries, list_objects_with_retries, RootTarget, MAX_RETRIES};
use crate::cloud_admin_api::BranchData;
use crate::{download_object_with_retries, list_objects_with_retries, RootTarget};
use pageserver::tenant::storage_layer::LayerFileName;
use pageserver::tenant::IndexPart;
use utils::id::TenantTimelineId;

pub async fn validate_pageserver_active_tenant_and_timelines(
s3_client: Arc<Client>,
s3_root: RootTarget,
admin_client: Arc<CloudAdminApiClient>,
batch_producer_stats: DeleteProducerStats,
) -> anyhow::Result<BranchCheckStats> {
let Some(timeline_stats) = batch_producer_stats.timeline_stats else {
info!("No tenant-only checks, exiting");
return Ok(BranchCheckStats::default());
};

let s3_active_projects = batch_producer_stats
.tenant_stats
.active_entries
.into_iter()
.map(|project| (project.id.clone(), project))
.collect::<HashMap<_, _>>();
info!("Validating {} active tenants", s3_active_projects.len());

let mut s3_active_branches_per_project = HashMap::<ProjectId, Vec<BranchData>>::new();
let mut s3_blob_data = HashMap::<TenantTimelineId, S3TimelineBlobData>::new();
for active_branch in timeline_stats.active_entries {
let active_project_id = active_branch.project_id.clone();
let active_branch_id = active_branch.id.clone();
let active_timeline_id = active_branch.timeline_id;

s3_active_branches_per_project
.entry(active_project_id.clone())
.or_default()
.push(active_branch);

let Some(active_project) = s3_active_projects.get(&active_project_id) else {
error!(
"Branch {:?} for project {:?} has no such project in the active projects",
active_branch_id, active_project_id
);
continue;
};

let id = TenantTimelineId::new(active_project.tenant, active_timeline_id);
s3_blob_data.insert(
id,
list_timeline_blobs(&s3_client, id, &s3_root)
.await
.with_context(|| format!("List timeline {id} blobs"))?,
);
}

let mut branch_checks = JoinSet::new();
for (_, s3_active_project) in s3_active_projects {
let project_id = &s3_active_project.id;
let tenant_id = s3_active_project.tenant;

let mut console_active_branches =
branches_for_project_with_retries(&admin_client, project_id)
.await
.with_context(|| {
format!("Client API branches for project {project_id:?} retrieval")
})?
.into_iter()
.map(|branch| (branch.id.clone(), branch))
.collect::<HashMap<_, _>>();

let active_branches = s3_active_branches_per_project
.remove(project_id)
.unwrap_or_default();
info!(
"Spawning tasks for {} tenant {} active timelines",
active_branches.len(),
tenant_id
);
for s3_active_branch in active_branches {
let console_branch = console_active_branches.remove(&s3_active_branch.id);
let timeline_id = s3_active_branch.timeline_id;
let id = TenantTimelineId::new(tenant_id, timeline_id);
let s3_data = s3_blob_data.remove(&id);
let s3_root = s3_root.clone();
branch_checks.spawn(
async move {
let check_errors = branch_cleanup_and_check_errors(
&id,
&s3_root,
Some(&s3_active_branch),
console_branch,
s3_data,
)
.await;
(id, check_errors)
}
.instrument(info_span!("check_timeline", id = %id)),
);
}
}

let mut total_stats = BranchCheckStats::default();
while let Some((id, analysis)) = branch_checks
.join_next()
.await
.transpose()
.context("branch check task join")?
{
total_stats.add(id, analysis.errors);
}
Ok(total_stats)
}

async fn branches_for_project_with_retries(
admin_client: &CloudAdminApiClient,
project_id: &ProjectId,
) -> anyhow::Result<Vec<BranchData>> {
for _ in 0..MAX_RETRIES {
match admin_client.branches_for_project(project_id, false).await {
Ok(branches) => return Ok(branches),
Err(e) => {
error!("admin list branches for project {project_id:?} query failed: {e}");
tokio::time::sleep(Duration::from_secs(1)).await;
}
}
}

anyhow::bail!("Failed to list branches for project {project_id:?} {MAX_RETRIES} times")
}

#[derive(Debug, Default)]
pub struct BranchCheckStats {
pub timelines_with_errors: HashMap<TenantTimelineId, Vec<String>>,
pub normal_timelines: HashSet<TenantTimelineId>,
}

impl BranchCheckStats {
pub fn add(&mut self, id: TenantTimelineId, check_errors: Vec<String>) {
if check_errors.is_empty() {
if !self.normal_timelines.insert(id) {
panic!("Checking branch with timeline {id} more than once")
}
} else {
match self.timelines_with_errors.entry(id) {
hash_map::Entry::Occupied(_) => {
panic!("Checking branch with timeline {id} more than once")
}
hash_map::Entry::Vacant(v) => {
v.insert(check_errors);
}
}
}
}
}

pub struct TimelineAnalysis {
pub(crate) struct TimelineAnalysis {
/// Anomalies detected
pub errors: Vec<String>,
pub(crate) errors: Vec<String>,

/// Healthy-but-noteworthy, like old-versioned structures that are readable but
/// worth reporting for awareness that we must not remove that old version decoding
/// yet.
pub warnings: Vec<String>,
pub(crate) warnings: Vec<String>,

/// Keys not referenced in metadata: candidates for removal
pub garbage_keys: Vec<String>,
/// Keys not referenced in metadata: candidates for removal, but NOT NECESSARILY: beware
/// of races between reading the metadata and reading the objects.
pub(crate) garbage_keys: Vec<String>,
}

impl TimelineAnalysis {
Expand All @@ -185,7 +34,7 @@ impl TimelineAnalysis {
}
}

pub async fn branch_cleanup_and_check_errors(
pub(crate) async fn branch_cleanup_and_check_errors(
id: &TenantTimelineId,
s3_root: &RootTarget,
s3_active_branch: Option<&BranchData>,
Expand Down Expand Up @@ -320,21 +169,21 @@ pub async fn branch_cleanup_and_check_errors(
}

#[derive(Debug)]
pub struct S3TimelineBlobData {
pub blob_data: BlobDataParseResult,
pub keys_to_remove: Vec<String>,
pub(crate) struct S3TimelineBlobData {
pub(crate) blob_data: BlobDataParseResult,
pub(crate) keys_to_remove: Vec<String>,
}

#[derive(Debug)]
pub enum BlobDataParseResult {
pub(crate) enum BlobDataParseResult {
Parsed {
index_part: IndexPart,
s3_layers: HashSet<LayerFileName>,
},
Incorrect(Vec<String>),
}

pub async fn list_timeline_blobs(
pub(crate) async fn list_timeline_blobs(
s3_client: &Client,
id: TenantTimelineId,
s3_root: &RootTarget,
Expand Down

1 comment on commit 7c16b52

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2409 tests run: 2199 passed, 88 failed, 122 skipped (full report)


Failures on Postgres 16

  • test_ondemand_download_timetravel[real_s3]: debug
  • test_ondemand_download_large_rel[real_s3]: debug
  • test_pageserver_restart[False]: debug
  • test_remote_storage_backup_and_restore[False-real_s3]: debug
  • test_tenant_delete_smoke[real_s3]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-background-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-remove-timelines-dir-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-create-local-mark-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-polling-ongoing-deletions-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-create-remote-mark-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-remove-timelines-dir-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-remove-deleted-mark-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm-dir-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-deleted-at-False]: debug
  • test_detach_while_attaching[real_s3]: debug
  • test_emergency_relocate_with_branches_createdb[real_s3]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-deleted-at]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm]: release, debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-during-rm]: release, debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm-metadata]: release, debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm-dir]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-during-rm]: release
  • test_timeline_delete_resumed_on_attach[real_s3]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-index-delete]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-delete]: debug
  • test_s3_wal_replay[real_s3]: release

Failures on Postgres 15

  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm-False]: debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-delete]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-deleted-at]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-during-rm]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-deleted-at]: release
  • test_timeline_delete_works_for_remote_smoke[real_s3]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm-dir]: release

Failures on Postgres 14

  • test_bulk_insert[neon]: release
  • test_startup: release
  • test_ondemand_download_large_rel[real_s3]: debug
  • test_pageserver_chaos: debug
  • test_remote_storage_backup_and_restore[False-real_s3]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm-False]: release, debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-deleted-at-False]: release
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm-dir-False]: release, debug
  • test_tenant_delete_smoke[real_s3]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-shutdown-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-create-remote-mark-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-create-local-mark-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-remove-timelines-dir-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-create-local-mark-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-cleanup-remaining-fs-traces-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-remove-timelines-dir-False]: debug
  • test_delete_tenant_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-remove-tenant-dir-False]: debug
  • test_emergency_relocate_with_branches_createdb[real_s3]: release
  • test_emergency_relocate_with_branches_slow_replay[real_s3]: release, debug
  • test_tenants_many[real_s3]: release, debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-deleted-at]: release, debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-schedule]: release, debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-index-delete]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-during-rm]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-delete]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm]: release, debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-during-rm]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm-metadata]: release
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-deleted-at]: debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm]: debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm]: debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-delete]: debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm-metadata]: debug
  • test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-schedule]: debug
  • test_timeline_resurrection_on_attach[True-real_s3]: debug
  • test_s3_wal_replay[real_s3]: release
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_bulk_insert[neon] or test_startup or test_ondemand_download_large_rel[debug-pg14-real_s3] or test_pageserver_chaos[debug-pg14] or test_remote_storage_backup_and_restore[debug-pg14-False-real_s3] or test_delete_tenant_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm-False] or test_delete_tenant_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-deleted-at-False] or test_delete_tenant_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm-dir-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm-dir-False] or test_tenant_delete_smoke[debug-pg14-real_s3] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-shutdown-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-create-remote-mark-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-create-local-mark-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-remove-timelines-dir-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-create-local-mark-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-cleanup-remaining-fs-traces-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-remove-timelines-dir-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-remove-tenant-dir-False] or test_emergency_relocate_with_branches_createdb[release-pg14-real_s3] or test_emergency_relocate_with_branches_slow_replay[release-pg14-real_s3] or test_emergency_relocate_with_branches_slow_replay[debug-pg14-real_s3] or test_tenants_many[release-pg14-real_s3] or test_tenants_many[debug-pg14-real_s3] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-deleted-at] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-deleted-at] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-schedule] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-schedule] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-index-delete] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-during-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-delete] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-during-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm-metadata] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-deleted-at] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-delete] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm-metadata] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-schedule] or test_timeline_resurrection_on_attach[debug-pg14-True-real_s3] or test_s3_wal_replay[release-pg14-real_s3] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg15-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm-False] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg15-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-delete] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg15-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg15-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-deleted-at] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg15-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-during-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg15-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg15-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg15-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-deleted-at] or test_timeline_delete_works_for_remote_smoke[release-pg15-real_s3] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg15-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm-dir] or test_ondemand_download_timetravel[debug-pg16-real_s3] or test_ondemand_download_large_rel[debug-pg16-real_s3] or test_pageserver_restart[debug-pg16-False] or test_remote_storage_backup_and_restore[debug-pg16-False-real_s3] or test_tenant_delete_smoke[debug-pg16-real_s3] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-background-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-remove-timelines-dir-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-create-local-mark-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-polling-ongoing-deletions-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-tenant-delete-before-create-remote-mark-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-remove-timelines-dir-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITH_RESTART-real_s3-tenant-delete-before-remove-deleted-mark-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm-dir-False] or test_delete_tenant_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-index-deleted-at-False] or test_detach_while_attaching[debug-pg16-real_s3] or test_emergency_relocate_with_branches_createdb[release-pg16-real_s3] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-deleted-at] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-during-rm] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-during-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm-metadata] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm-metadata] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm-dir] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg16-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-before-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg16-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-after-rm] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg16-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-during-rm] or test_timeline_delete_resumed_on_attach[release-pg16-real_s3] or test_delete_timeline_exercise_crash_safety_failpoints[release-pg16-Check.RETRY_WITH_RESTART-real_s3-timeline-delete-after-index-delete] or test_delete_timeline_exercise_crash_safety_failpoints[debug-pg16-Check.RETRY_WITHOUT_RESTART-real_s3-timeline-delete-before-index-delete] or test_s3_wal_replay[release-pg16-real_s3]"
Flaky tests (4)

Postgres 16

  • test_pageserver_restart[True]: debug
  • test_pageserver_with_empty_tenants[real_s3]: release

Postgres 14

  • test_pageserver_chaos: debug
  • test_pageserver_with_empty_tenants[real_s3]: release

Test coverage report is not available

The comment gets automatically updated with the latest test results
7c16b52 at 2023-10-26T20:30:44.605Z :recycle:

Please sign in to comment.