Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vault agent certificate renewal troubles #26367

Open
celesteking opened this issue Apr 11, 2024 · 5 comments
Open

vault agent certificate renewal troubles #26367

celesteking opened this issue Apr 11, 2024 · 5 comments
Assignees

Comments

@celesteking
Copy link

celesteking commented Apr 11, 2024

Have decided to replace our CA with Vault but I'm now facing 2 issues:

  1. Vault Agent won't reload the renewed certificate. There's an issue open but it says it was resolved. Apparently, it wasn't or this is a new issue. What's interesting is that it happens randomly and sporadically.
  2. Vault Agent will DoS the server if it can't renew the certificate.

The setup over here is pretty straightforward: Vault pki is used to issue the certificates and Vault agent is used to TLS cert auth against Vault server to request a renewed certificate. It's a chicken-and-egg scenario where initial certificate is issued manually and then gets indefinitely renewed. It's a poor man's certbot alternative that was supposed to work beautifully but it doesn't.

Vault agent host:

Apr 09 08:58:33 host.net vault[25781]: 2024-04-09T08:58:33.501-0500 [INFO]  agent.auth.handler: renewed auth token
Apr 09 18:11:50 host.net vault[25781]: 2024-04-09T18:11:50.890-0500 [INFO]  agent.auth.handler: renewed auth token
Apr 10 02:07:37 host.net vault[25781]: 2024-04-10T02:07:37.389-0500 [WARN]  agent: vault.pki(pki/issue/endpoint-selfrenew->/var/lib/puppet/ssl/endpoint.pem): TTL "2160h0m0s" is longer than permitted maxTTL "720h0m0s", so maxTTL is being used
Apr 10 02:07:37 host.net vault[25781]: 2024-04-10T02:07:37.389-0500 [WARN]  agent: vault.pki(pki/issue/endpoint-selfrenew->/var/lib/puppet/ssl/endpoint.pem): TTL "2160h0m0s" is longer than permitted maxTTL "720h0m0s", so maxTTL is being used
Apr 10 02:07:37 host.net vault[25781]: 2024-04-10T02:07:37.434-0500 [INFO]  agent: (runner) rendered "(dynamic)" => "/var/lib/puppet/ssl/endpoint.pem"
Apr 10 03:25:08 host.net vault[25781]: 2024-04-10T03:25:08.252-0500 [INFO]  agent.auth.handler: renewed auth token
Apr 10 12:38:25 host.net vault[25781]: 2024-04-10T12:38:25.637-0500 [INFO]  agent.auth.handler: renewed auth token
Apr 10 23:58:33 host.net vault[25781]: 2024-04-10T23:58:33.798-0500 [INFO]  agent.auth.handler: lifetime watcher done channel triggered
  << around this time server's /var overflowed >>
Apr 10 23:58:33 host.net vault[25781]: 2024-04-10T23:58:33.814-0500 [INFO]  agent.auth.handler: authenticating
Apr 10 23:58:33 host.net vault[25781]: 2024-04-10T23:58:33.865-0500 [ERROR] agent.auth.handler: error authenticating:
Apr 10 23:58:33 host.net vault[25781]: error=
Apr 10 23:58:33 host.net vault[25781]: | Error making API request.
Apr 10 23:58:33 host.net vault[25781]: |
Apr 10 23:58:33 host.net vault[25781]: | URL: PUT https://vault.srv:8200/v1/auth/cert/login
Apr 10 23:58:33 host.net vault[25781]: | Code: 500. Errors:
Apr 10 23:58:33 host.net vault[25781]: |
Apr 10 23:58:33 host.net vault[25781]: | * internal error
Apr 10 23:58:33 host.net vault[25781]: backoff=980ms
Apr 10 23:58:34 host.net vault[25781]: 2024-04-10T23:58:34.849-0500 [INFO]  agent.auth.handler: authenticating
Apr 10 23:58:34 host.net vault[25781]: 2024-04-10T23:58:34.988-0500 [ERROR] agent.auth.handler: error authenticating:
Apr 10 23:58:34 host.net vault[25781]: error=
Apr 10 23:58:34 host.net vault[25781]: | Error making API request.
Apr 10 23:58:34 host.net vault[25781]: |
Apr 10 23:58:34 host.net vault[25781]: | URL: PUT https://vault.srv:8200/v1/auth/cert/login
Apr 10 23:58:34 host.net vault[25781]: | Code: 500. Errors:
Apr 10 23:58:34 host.net vault[25781]: |
Apr 10 23:58:34 host.net vault[25781]: | * internal error
Apr 10 23:58:34 host.net vault[25781]: backoff=980ms
Apr 10 23:58:36 host.net vault[25781]: 2024-04-10T23:58:36.590-0500 [INFO]  agent.auth.handler: authenticating
Apr 10 23:58:36 host.net vault[25781]: 2024-04-10T23:58:36.712-0500 [ERROR] agent.auth.handler: error authenticating:
Apr 10 23:58:36 host.net vault[25781]: error=
Apr 10 23:58:36 host.net vault[25781]: | Error making API request.
Apr 10 23:58:36 host.net vault[25781]: |
Apr 10 23:58:36 host.net vault[25781]: | URL: PUT https://vault.srv:8200/v1/auth/cert/login
Apr 10 23:58:36 host.net vault[25781]: | Code: 500. Errors:
Apr 10 23:58:36 host.net vault[25781]: |
Apr 10 23:58:36 host.net vault[25781]: | * internal error
Apr 10 23:58:36 host.net vault[25781]: backoff=1.6s
Apr 10 23:58:39 host.net vault[25781]: 2024-04-10T23:58:39.138-0500 [INFO]  agent.auth.handler: authenticating
Apr 10 23:58:39 host.net vault[25781]: 2024-04-10T23:58:39.279-0500 [ERROR] agent.auth.handler: error authenticating:
Apr 10 23:58:39 host.net vault[25781]: error=
Apr 10 23:58:39 host.net vault[25781]: | Error making API request.
Apr 10 23:58:39 host.net vault[25781]: |
Apr 10 23:58:39 host.net vault[25781]: | URL: PUT https://vault.srv:8200/v1/auth/cert/login
Apr 10 23:58:39 host.net vault[25781]: | Code: 500. Errors:
Apr 10 23:58:39 host.net vault[25781]: |
Apr 10 23:58:39 host.net vault[25781]: | * internal error
Apr 10 23:58:39 host.net vault[25781]: backoff=2.42s
... and so on each ~5min

But on Vault server (mind the TZ):

#... | jq [.time, .auth.metadata.common_name, .request.operation, .request.path, .error]' 
["2024-04-09T13:58:33.4973504Z","host.net","update","auth/token/renew-self",null]
["2024-04-09T23:11:50.889135802Z","host.net","update","auth/token/renew-self",null]
["2024-04-10T07:07:37.378823826Z","host.net","update","pki/issue/endpoint-selfrenew",null]
["2024-04-10T08:25:08.250091454Z","host.net","update","auth/token/renew-self",null]
["2024-04-10T17:38:25.635681677Z","host.net","update","auth/token/renew-self",null]
["2024-04-11T02:51:42.975081291Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T21:51:42-05:00 is after 2024-04-10T21:45:31Z\n\n"]
["2024-04-11T02:51:43.041792099Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T21:51:43-05:00 is after 2024-04-10T21:45:31Z\n\n"]
["2024-04-11T02:51:43.067654351Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T21:51:43-05:00 is after 2024-04-10T21:45:31Z\n\n"]
["2024-04-11T02:51:43.086646745Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T21:51:43-05:00 is after 2024-04-10T21:45:31Z\n\n"]
.............
["2024-04-11T03:14:00.809214144Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>
["2024-04-11T03:14:00.829559363Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>
["2024-04-11T03:14:00.848438908Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>
["2024-04-11T03:14:00.868140783Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>
["2024-04-11T03:14:00.888379511Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>
["2024-04-11T03:14:00.909593973Z","host.net","update","auth/token/renew-self","1 error occurred:\n\t* failed to renew entry: failed to verify client's certificate: x509: certificate has expired or is not yet valid: current time 2024-04-10T22:14:00-05:00 is after 2>

It goes on and on, each 0.02 sec! As you see, it won't slow down, it hammers the server as much as it can and pollutes the audit log. This is a DoS in its pure form. That's just one host, imagine what happens when several agents decide go nuts.

disable_idle_connections = ["auto-auth", "templating"]
vault {
}
auto_auth {
  method {
    type   = "cert"
    config = {
      reload      = true
      client_cert = "/var/lib/puppet/ssl/endpoint.pem"
      client_key  = "/var/lib/puppet/ssl/endpoint.pem"
    }
  }
}
template_config {
  exit_on_retry_failure = true
}
template {
  error_on_missing_key = true
  create_dest_dirs     = false
  destination          = "/var/lib/puppet/ssl/endpoint.pem"
  contents             = <<-EOF
    {{- with pkiCert "pki/issue/endpoint-selfrenew" (print "common_name=" (env "HOSTNAME" | toLower ))  "ttl=90d" -}}
    {{- .Data.Cert }}
    {{- .Data.Key }}
    {{- end -}}
  EOF
}

# openssl x509 -dates  < /var/lib/puppet/ssl/endpoint.pem
notBefore=Apr 10 07:07:07 2024 GMT
notAfter=May 10 07:07:37 2024 GMT

I won't show server config as there's nothing interesting there, it's a dull almost empty config.
v1.16.1 both on server and clients

Please help.

@celesteking
Copy link
Author

Another point I'd like to bring up is the renewal trigger time. As per docs:

Vault Agent renews the certificate at half the lease duration of the original.

But it waited (2024-04-10T07:07:37) until almost expiration (2024-04-10T21:45:31Z) of the cert. As you see, they're issued for 1 month, so the cutoff time should've been 2024-03-25 or so.

@kitography kitography self-assigned this Apr 11, 2024
@Legogris
Copy link

Do you get different behavior if changing template_config.lease_renewal_threshold (PR) from default 0.9 to, say, 0.5?

There is also template_config.static_secret_render_interval, but I am not sure if that is relevant here.

@celesteking
Copy link
Author

That PR hasn't made into RPM yet thus I can't try. All this Vault PKI thing isn't Enterprise ready. It looks like there are no Enterprise customers who are using Vault in that fashion (private pki/certbot replacement) otherwise we would've seen tons of complaints.

As regarding the lease_renewal_threshold which you say is 90%, I wish it were 90%. Looking at logs, it stays around ~1% till cert expiration when the renewal kicks off -- that's way too close to the cutoff.

@VioletHynes
Copy link
Contributor

Hi there! I just wanted to let you know that I linked this to a bug we have open in our backlog we're trying to prioritize soon. Renewal retries without backoff is obviously something we should address, and we're planning to and hoping to soon. I can't promise more than that at this point, but I wanted to let you know that this is on our mind.

@celesteking
Copy link
Author

celesteking commented Apr 19, 2024

That's absolutely good news. Bullet 2) can be taken off the table then. But the first one is still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants