`vault operator raft snapshot save` and `restore` fail to handle redirection to the active node #15258

maxb · 2022-05-02T16:27:45Z

Scenario: A 3-node Vault cluster using Raft storage, accessed via a load-balanced URL which can contact any one of the unsealed nodes.

Attempt to use vault operator raft snapshot save:

If it lands on a standby node, a rather opaque error is produced:

Error taking the snapshot: incomplete snapshot, unable to read SHA256SUMS.sealed file

Attempt to use vault operator raft snapshot restore:

If it lands on a standby node, a rather opaque error is produced:

Error installing the snapshot: redirect failed: Post "http://172.18.0.11:8200/v1/sys/storage/raft/snapshot": read snapshot.tar.gz: file already closed

The text was updated successfully, but these errors were encountered:

hsimon-hashicorp · 2022-05-03T16:16:38Z

Hi there, @maxb! Thanks for this issue report. Our engineering teams are aware of this issue, and we have an item in the backlog to address it. (For my own internal tracking, it's VAULT-4568.) It hasn't been prioritized yet, however, so all I can currently say is to check out future release notes. :)

dtulnovgg · 2022-05-04T08:54:43Z

Same behavior. Any workarounds except executing snapshot operations on the leader node?

tcdev0 · 2022-05-06T07:34:56Z

i use a small backup script on every node. skipping snapshots on follower nodes.

...
# snapshot if leader
if [ "$(vault operator raft list-peers --format=json | jq --raw-output '.data.config.servers[] | select(.leader==true) | .node_id')" = "$(hostname -a)" ]; then
  echo "make raft snapshot $raft_backup/$time.snapshot ..."
  /usr/local/bin/vault operator raft snapshot save $raft_backup/$time.snapshot
else
  echo "not leader, skipping raft snapshot."
fi

pmcatominey · 2022-09-21T22:21:09Z

Traced the issue to #14269, the result is never updated here with the response of the redirected request.

maxb · 2023-01-07T09:22:54Z

Although the linked PR #17269 has rightly identified a logic bug which should be fixed, it doesn't wholly fix this issue.

Many people may be running Vault behind a loadbalancer, without direct access to individual backend nodes. Just making the vault CLI client process the redirection properly, won't help at all if it doesn't have network access to the redirected URL!

hardeepsingh3 · 2023-02-22T20:36:05Z

I'm also having the same issue while running Vault within AKS and running the raft snapshot save command on the leader raft pod. Any luck on a solution here?

fancybear-dev · 2023-02-27T16:11:45Z

We had a similar issue to this as well. I find it really weird there is no real solution for it from Hashicorp (proper redirection?), given raft in HA is advised as well.

We run a cluster template in HA of 5 VM's total with raft. We use a MIG in GCP. We had the issue, we couldn't reliably create snapshots, because it would only work if the request would end up at the leader. The load balancer does not allow for you to route it to specific VM"s -> which is logical -> it's a loadbalancer lol.

Our fix was to create a separate backend service, with health checks that check for /v1/sys/leader to see if is_self equals true. This creates a backend, that only sees a single VM as healthy -> the leader. The backend is only used for the related snapshot API call. The load balancer only routes to healthy VM's -> so it always routes correctly. Problem solved.

This tactic can also be used in other cloud environments, so perhaps this helps some people.

mohsen-abbas · 2024-02-01T06:08:56Z

We have consistently encountered the same issue with our Vault HA cluster on Kubernetes. Each time a new leader is elected, it necessitates the modification of the leader's VAULT_ADDR address in our cronjob. Essentially, we have set up a cronjob to regularly back up the Vault cluster and synchronize it with a GCP bucket.

Is there a way to dynamically determine the runtime leader and direct requests solely to the current leader of the cluster? Below is a snippet of the cronjob for your reference, and we welcome any further suggestions you may have. Your assistance is greatly appreciated.

apiVersion: batch/v1
kind: CronJob
metadata:
name: vault-snapshot-cronjob
namespace: vault-secrets-server
spec:
schedule: "0 0 * * *"
jobTemplate:
spec:
template:
spec:
serviceAccountName: vault-snapshotter
volumes:
- name: gcs-credentials
secret:
secretName: gcs-credentials
- name: backup-dir
emptyDir: {}
containers:
- name: backup
image: vault:1.12.1
imagePullPolicy: IfNotPresent
env:
- name: VAULT_ADDR
value: http://vault-server-1.vault-server-internal:8200
command: ["/bin/sh", "-c"]
args:
- |
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
export VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login jwt=$SA_TOKEN role=vault-backup);
vault operator raft snapshot save /data/vault-raft.snap;
sleep 120;
volumeMounts:
- name: backup-dir
mountPath: /data
- name: snapshotupload
image: google/cloud-sdk:latest
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c"]
args:
- |
until [ -f /data/vault-raft.snap ]; do sleep 120; done;
gcloud auth activate-service-account --key-file=/data/credentials/service-account.json;
gsutil cp /data/vault-raft.snap gs://$bucket_name/vault_raft_$(date +"%Y%m%d_%H%M%S").snap;
volumeMounts:
- name: gcs-credentials
mountPath: /data/credentials
readOnly: true
- name: backup-dir
mountPath: /data
restartPolicy: OnFailure
ttlSecondsAfterFinished: 900

hsimon-hashicorp added bug Used to indicate a potential bug storage/raft labels May 2, 2022

johnalotoski mentioned this issue Jul 20, 2022

Vault snap fix input-output-hk/bitte#170

Merged

pmcatominey mentioned this issue Sep 21, 2022

api: use response from redirected request #17269

Closed

maxb mentioned this issue Jan 6, 2023

vault snapshot save redirecting to the pod ip #18552

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`vault operator raft snapshot save` and `restore` fail to handle redirection to the active node #15258

`vault operator raft snapshot save` and `restore` fail to handle redirection to the active node #15258

maxb commented May 2, 2022

hsimon-hashicorp commented May 3, 2022

dtulnovgg commented May 4, 2022

tcdev0 commented May 6, 2022

pmcatominey commented Sep 21, 2022

maxb commented Jan 7, 2023

hardeepsingh3 commented Feb 22, 2023

fancybear-dev commented Feb 27, 2023 •

edited

mohsen-abbas commented Feb 1, 2024

vault operator raft snapshot save and restore fail to handle redirection to the active node #15258

vault operator raft snapshot save and restore fail to handle redirection to the active node #15258

Comments

maxb commented May 2, 2022

hsimon-hashicorp commented May 3, 2022

dtulnovgg commented May 4, 2022

tcdev0 commented May 6, 2022

pmcatominey commented Sep 21, 2022

maxb commented Jan 7, 2023

hardeepsingh3 commented Feb 22, 2023

fancybear-dev commented Feb 27, 2023 • edited

mohsen-abbas commented Feb 1, 2024

`vault operator raft snapshot save` and `restore` fail to handle redirection to the active node #15258

`vault operator raft snapshot save` and `restore` fail to handle redirection to the active node #15258

fancybear-dev commented Feb 27, 2023 •

edited