Missing retry logic when using GCS backend might lead to cluster unavailable #26673

fcrespofastly · 2024-04-27T00:42:24Z

Describe the bug
When acquiring the leadership, a Vault replica will start restoring leases. If a single restore fails because GCS is unavailable, then that vault instance gets sealed. In a fairly large environment with lots of leases there're chances where this can happen to all the vault replicas in a short period of time, causing the whole cluster to seal. This has happened to us in a production cluster with GCS returning just a few (5) 503 (which it's considered a retryable error and the golang client used in this vault version (v1.30.1) should retry by default unless the context is canceled which I couldn't spot it).

To Reproduce
Hard to reproduce as it depends on a third party being unavailable.

Expected behavior
The golang GCS client library will retry as it's considered a retryble operation and status code

Environment:

Vault Server Version (retrieve with vault status): 1.15.4
Vault CLI Version (retrieve with vault version): v.1.15.6 (irrelevant)
Server Operating System/Architecture: Vault cluster 3 replicas running on kubernetes using GCS as the backend and kms to encrypt the keys

Vault server configuration file(s):

            api_addr     = "https://vault.{{ .Release.Namespace }}.svc:8200"
            cluster_addr = "https://$(POD_IP_ADDR):8201"

            log_level = "{{ .Values.logLevel }}"

            plugin_directory = "{{ .Values.pluginDirectory }}"

            ui = true

            seal "gcpckms" {
              project    = "{{ include "elvlib.cluster.gcpProjectId" $ }}"
              region     = "{{ include "elvlib.cluster.gcpRegion" $ }}"
              key_ring   = "{{ include "vault.config.keyRing" $ }}"
              crypto_key = "{{ .Values.cryptoKey }}"
            }

            storage "gcs" {
              bucket     = "{{ include "vault.config.bucket" $ }}"
              ha_enabled = "true"
            }

            telemetry "prometheus" {
              prometheus_retention_time = "60s"
              disable_hostname = true
              usage_gauge_period = "15s"
            }

            listener "tcp" {
              address       = "127.0.0.1:8200"
              tls_cert_file = "/etc/vault/tls/tls.crt"
              tls_key_file  = "/etc/vault/tls/tls.key"

              tls_disable_client_certs = true
            }

            listener "tcp" {
              address       = "$(POD_IP_ADDR):8200"
              tls_cert_file = "/etc/vault/tls/tls.crt"
              tls_key_file  = "/etc/vault/tls/tls.key"

              tls_disable_client_certs = true
              telemetry {
                unauthenticated_metrics_access = true
              }
            }

Additional context
The golang client library for GCS supports retries by default. The response code (503) is a retryable error and the HTTP method (GET) is considered idempotent so it should be retrying unless the context is canceled which I could not spot it (and doesn't seem so as the Restore function error will be different as per the code:

defer func() {

		// Turn off restore mode. We can do this safely without the lock because

		// if restore mode finished successfully, restore mode was already

		// disabled with the lock. In an error state, this will allow the

		// Stop() function to shut everything down.

		atomic.StoreInt32(m.restoreMode, 0)



		switch {

		case retErr == nil:

		case strings.Contains(retErr.Error(), context.Canceled.Error()):

			// Don't run error func because we lost leadership

			m.logger.Warn("context canceled while restoring leases, stopping lease loading")

			retErr = nil

		case errwrap.Contains(retErr, ErrBarrierSealed.Error()):

			// Don't run error func because we're likely already shutting down

			m.logger.Warn("barrier sealed while restoring leases, stopping lease loading")

			retErr = nil

		default:

			m.logger.Error("error restoring leases", "error", retErr)

			if errorFunc != nil {

				errorFunc()

			}

		}

	}()

Google Cloud Support confirmed we're not receiving a lot of 503s and also our monitoring confirms the same, during the time window where the loop of:
for i in [0,1,2]:
- acquire lock
- restore leases
- fail to read lease with 503

We only see 3 503s, 1 per replica.

The text was updated successfully, but these errors were encountered:

fcrespofastly · 2024-04-27T14:28:07Z

Update

I can see the context was canceled so I believe that's why the google storage go client is not retrying the requests.

My next step is to figure out the context cancelation process and how this impacted the entire Vault cluster.

fcrespofastly mentioned this issue Apr 27, 2024

Mount table corrupted after GCS rate limiting #7455

Open

stevendpclark added bug Used to indicate a potential bug storage/gcs labels May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing retry logic when using GCS backend might lead to cluster unavailable #26673

Missing retry logic when using GCS backend might lead to cluster unavailable #26673

fcrespofastly commented Apr 27, 2024 •

edited

fcrespofastly commented Apr 27, 2024

Missing retry logic when using GCS backend might lead to cluster unavailable #26673

Missing retry logic when using GCS backend might lead to cluster unavailable #26673

Comments

fcrespofastly commented Apr 27, 2024 • edited

fcrespofastly commented Apr 27, 2024

Update

fcrespofastly commented Apr 27, 2024 •

edited