Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing retry logic when using GCS backend might lead to cluster unavailable #26673

Open
fcrespofastly opened this issue Apr 27, 2024 · 1 comment
Labels
bug Used to indicate a potential bug storage/gcs

Comments

@fcrespofastly
Copy link

fcrespofastly commented Apr 27, 2024

Describe the bug
When acquiring the leadership, a Vault replica will start restoring leases. If a single restore fails because GCS is unavailable, then that vault instance gets sealed. In a fairly large environment with lots of leases there're chances where this can happen to all the vault replicas in a short period of time, causing the whole cluster to seal. This has happened to us in a production cluster with GCS returning just a few (5) 503 (which it's considered a retryable error and the golang client used in this vault version (v1.30.1) should retry by default unless the context is canceled which I couldn't spot it).

To Reproduce
Hard to reproduce as it depends on a third party being unavailable.

Expected behavior
The golang GCS client library will retry as it's considered a retryble operation and status code

Environment:

  • Vault Server Version (retrieve with vault status): 1.15.4
  • Vault CLI Version (retrieve with vault version): v.1.15.6 (irrelevant)
  • Server Operating System/Architecture: Vault cluster 3 replicas running on kubernetes using GCS as the backend and kms to encrypt the keys

Vault server configuration file(s):

            api_addr     = "https://vault.{{ .Release.Namespace }}.svc:8200"
            cluster_addr = "https://$(POD_IP_ADDR):8201"

            log_level = "{{ .Values.logLevel }}"

            plugin_directory = "{{ .Values.pluginDirectory }}"

            ui = true

            seal "gcpckms" {
              project    = "{{ include "elvlib.cluster.gcpProjectId" $ }}"
              region     = "{{ include "elvlib.cluster.gcpRegion" $ }}"
              key_ring   = "{{ include "vault.config.keyRing" $ }}"
              crypto_key = "{{ .Values.cryptoKey }}"
            }

            storage "gcs" {
              bucket     = "{{ include "vault.config.bucket" $ }}"
              ha_enabled = "true"
            }

            telemetry "prometheus" {
              prometheus_retention_time = "60s"
              disable_hostname = true
              usage_gauge_period = "15s"
            }

            listener "tcp" {
              address       = "127.0.0.1:8200"
              tls_cert_file = "/etc/vault/tls/tls.crt"
              tls_key_file  = "/etc/vault/tls/tls.key"

              tls_disable_client_certs = true
            }

            listener "tcp" {
              address       = "$(POD_IP_ADDR):8200"
              tls_cert_file = "/etc/vault/tls/tls.crt"
              tls_key_file  = "/etc/vault/tls/tls.key"

              tls_disable_client_certs = true
              telemetry {
                unauthenticated_metrics_access = true
              }
            }

Additional context
The golang client library for GCS supports retries by default. The response code (503) is a retryable error and the HTTP method (GET) is considered idempotent so it should be retrying unless the context is canceled which I could not spot it (and doesn't seem so as the Restore function error will be different as per the code:

defer func() {

		// Turn off restore mode. We can do this safely without the lock because

		// if restore mode finished successfully, restore mode was already

		// disabled with the lock. In an error state, this will allow the

		// Stop() function to shut everything down.

		atomic.StoreInt32(m.restoreMode, 0)



		switch {

		case retErr == nil:

		case strings.Contains(retErr.Error(), context.Canceled.Error()):

			// Don't run error func because we lost leadership

			m.logger.Warn("context canceled while restoring leases, stopping lease loading")

			retErr = nil

		case errwrap.Contains(retErr, ErrBarrierSealed.Error()):

			// Don't run error func because we're likely already shutting down

			m.logger.Warn("barrier sealed while restoring leases, stopping lease loading")

			retErr = nil

		default:

			m.logger.Error("error restoring leases", "error", retErr)

			if errorFunc != nil {

				errorFunc()

			}

		}

	}()
Google Cloud Support confirmed we're not receiving a lot of 503s and also our monitoring confirms the same, during the time window where the loop of:
for i in [0,1,2]:
- acquire lock
- restore leases
- fail to read lease with 503

We only see 3 503s, 1 per replica.

@fcrespofastly
Copy link
Author

Update

I can see the context was canceled so I believe that's why the google storage go client is not retrying the requests.

My next step is to figure out the context cancelation process and how this impacted the entire Vault cluster.

@stevendpclark stevendpclark added bug Used to indicate a potential bug storage/gcs labels May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug storage/gcs
Projects
None yet
Development

No branches or pull requests

2 participants