Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault is unable to restore a large snapshot #24245

Closed
benvanstaveren opened this issue Nov 23, 2023 · 3 comments
Closed

Vault is unable to restore a large snapshot #24245

benvanstaveren opened this issue Nov 23, 2023 · 3 comments

Comments

@benvanstaveren
Copy link

Describe the bug
Vault is seemingly unable to restore a 29Gb snapshot

To Reproduce
Get yourself a nice big snapshot, attempt to restore it to a newly initialized cluster (using -force), watch the errors

Expected behavior
A snapshot to be restored

Environment:

  • Vault Server Version (retrieve with vault status): 1.15.2
  • Vault CLI Version (retrieve with vault version): 1.15.2
  • Server Operating System/Architecture: Linux AMD64

Vault server configuration file(s):

disable_mlock = true
storage "raft" {
    path    = "/opt/vault/data"
    node_id = "myshinynode"
}
service_registration "consul" {
    address      = "127.0.0.1:8500"
}

ui = "true"
pid_file = "/opt/vault/vault.pid"

listener "tcp" {
    address                             = "10.x.x.2:8200"
    tls_disable                         = "true"
    proxy_protocol_behavior             = "allow_authorized"
    proxy_protocol_authorized_addrs     = "10.x.x.29"
    x_forwarded_for_authorized_addrs    = "10.x.x.29"
    x_forwarded_for_reject_not_authorized = "false"
    x_forwarded_for_reject_not_present  = "false"
    max_request_duration                = "3600s"
    max_request_size                    = -1
}
listener "tcp" {
    address                             = "127.0.0.1:8200"
    tls_disable                         = "true"
    max_request_duration                = "3600s"
    max_request_size                    = -1
}
api_addr = "http://10.x.x.2:8200"
cluster_addr = "https://10.x.x.2:8201"

Additional context
I can't replicate the exact error message at the moment due to being in the middle of an attempt at recovery using some filthy methods, but:

attempt #1: "could not read request body"
then increased the vault client timeout by export VAULT_CLIENT_TIMEOUT=86400s
attempt #2: "could not read request body"
then set the max_request_duration and max_request_size in the vault listeners config
attempt #3..n: "broken pipe"

This tells me that the vault client is attempting to dump the entire 29Gb to the vault server in one sitting, and the vault server is obviously not liking this very much.

It's mildly annoying that the "official" backup and restore method isn't actually working to restore the backup I made...

@benvanstaveren
Copy link
Author

To add: if you switch to using curl instead of the vault client, to restore a (now) 31Gb snapshot, you need a machine with more than 64Gb memory because otherwise the OOM killer will get you. I'm concerned this is a problem for Consul and Nomad snapshots as well and kind of puts me ill at ease with regards to restoring from disastrous outages.

@benvanstaveren
Copy link
Author

Set up a fresh server with vault, 128gb ram, reasonably freshly taken snapshot, and this is the result of attempting to restore said snapshot with curl:

# curl -v --header 'X-Vault-Token: redacted.token' --request POST --data-binary @2023-11-23-22h13.snap http://127.0.0.1:8200/v1/sys/storage/raft/snapshot-force
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:8200...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8200 (#0)
> POST /v1/sys/storage/raft/snapshot-force HTTP/1.1
> Host: 127.0.0.1:8200
> User-Agent: curl/7.68.0
> Accept: */*
> X-Vault-Token: redacted.token
> Content-Length: 33029265974
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Cache-Control: no-store
< Content-Type: application/json
< Strict-Transport-Security: max-age=31536000; includeSubDomains
< Date: Sat, 25 Nov 2023 02:26:24 GMT
< Content-Length: 43
< Connection: close
<
{"errors":["failed to read request body"]}
* we are done reading and this is set to close, stop send
* Closing connection 0

The vault log (with level debug) shows only the following:

Nov 25 03:26:24 mynode vault[13768]: 2023-11-25T02:26:24.941Z [DEBUG] core: completed_request: start_time=2023-11-25T02:25:52Z duration=32749ms client_id="" client_address=127.0.0.1:38232 status_code=500 request_path=/v1/sys/storage/raft/snapshot-force request_method=POST

What do I do? I'm now running our supposed "vault cluster" on a single node, that I can back up to snapshots, but I apparently cannot restore said snapshots. I'd like a solution...

@benvanstaveren
Copy link
Author

Welp. Solution found: increasing the http_read_timeout on the listener did the trick; I do still feel the default timeout on this is too low for production use, I'm quite sure I'm not the only one with large snapshots to restore. Anyway. I'll close this, but maybe an idea to document this somewhere (i.e. large snapshots -> increase http read timeout)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant