Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] allow reload of Consul config stanza and Consul client #4593

Open
jrasell opened this issue Aug 17, 2018 · 17 comments
Open

[feature] allow reload of Consul config stanza and Consul client #4593

jrasell opened this issue Aug 17, 2018 · 17 comments

Comments

@jrasell
Copy link
Member

jrasell commented Aug 17, 2018

When using TLS for Consul connectivity, it is obviously preferable to use low TTL certificates with frequent renewal. Currently in order to renew the Consul certificates the Nomad client must be restarted. It would be preferable if the Nomad client could use SIGUP in order to reload the Consul client/config like reloading the Nomad TLS certificates. This would be a much less intrusive operation and involve less risk.

@rkettelerij
Copy link
Contributor

I assume this only applies if you configured the Nomad agent to talk with the Consul agent over HTTPS (https://www.nomadproject.io/docs/agent/configuration/consul.html#ssl)? If you use plain HTTP (over localhost) and you've configured Consul with short-lived certificates then you'll only need to restart the Consul agent when certificate renewal occurs.

@jrasell
Copy link
Member Author

jrasell commented Aug 30, 2018

@rkettelerij you're correct.

@nvx
Copy link
Contributor

nvx commented Feb 20, 2019

Allowing reloading of the Consul config stanza would also allow for refreshing the consul ACL token as well which I suspect would be a more common use-case.

@nagraskal
Copy link

I'd like to see both reloading of the acl token as well as reloading client certificates when reloading the nomad agent. we are obligated to use verify_incoming on the consul agents and using short-living certificates, and restarting nomad agent at every cert and/or token renewal is painful.

@Xopherus
Copy link
Contributor

I'd really like this feature as well. Nomad is really lacking in support for reloading TLS configurations. Right now you can only update the tls configuration for the nomad agents themselves. That doesn't help you when your entire cluster (i.e., Nomad + Vault + Consul) uses the same root CA. You end up in a situation where you might not lose your quorum, but you can't actually schedule any work. There are many issues that I've run into which are all related to this core problem (#3247, #3746, #4413, #4593, #6052).

I've been doing a little bit of digging and it seems like the reloading logic is scattered across the agent, client, and server code. So the reloading logic is very inconsistent across the board. From what I can gather, we seem to be in this state:

Agent

  • Reload tls configuration? YES
  • Reload vault configuration? N/A - vault client is tied to servers + clients, not agent.
  • Reload consul configuration? NO. Not supported.

Server

  • Reload tls configuration? YES
  • Reload vault configuration? NO* - Servers will reload ONLY if you change the path to your CAFile, CertFile, or KeyFile. If you reload a new cert to the same file, it does not reload. I opened Reload VaultConfig if CAFile, CertFile, KeyFile have changed #6677 to try and take a stab at fixing it. I tried testing with it, but I'm still running into vault integration issues so it makes me feel like there's more that is missing.
  • Reload consul configuration? N/A - consul client is tied to agent, not server.

Clients

  • Reload tls configuration? YES
  • Reload vault configuration? NO. Not supported. Seems like the client's vault integration is much different than the servers, so it seems like it would take some refactoring to support this.
  • Reload consul configuration? N/A - consul client is tied to agent, not client.

The other downside to all of this is that because Nomad has partial support for SIGHUP reloading, you'd think that you could use some combination of reloads + a full restart to refresh all of your tls configuration - but if you don't orchestrate it right, you run into #3885. This is a major problem which really hurts the operational side of things. Its a shame because other Hashicorp tools like Vault and consul-template already support reloading tls configurations via SIGHUP. I know that Nomad has far more configurations to update, but honestly this has been a big problem for at least the last 3 years I've used Nomad. Rolling your own PKI with Vault and using that in your hashistack cluster should be a best practice!

I would really like to help address this problem but I think this may require some significant refactoring to enable this. Any help would be greatly appreciated.

@quinndiggity
Copy link

Couple notes - for those using:

  • full mTLS, process<->process
  • Consul client certificate requirement/validation
  • remote Consul api (for a variety of reasons)
  • etc

It is crucial that Nomad reload these properties on a SIGHUP (as otherwise it requires an unacceptably risky restart vs what should be a simple reload):

consul {
  ca_file   = "/etc/consul.d/tls/tls.ca.d/vault.ca.example.internal.chain.pem"
  cert_file = "/etc/consul.d/tls/tls.crt.pem"
  key_file  = "/etc/consul.d/tls/tls.key.pem"
}

Seeing as Consul Connect by design/default uses a CA which itself rotates frequently and autonomously, all three of key_file, cert_file, and ca_file should absolutely be reloaded.

@quinndiggity
Copy link

For our use case, we bootstrap Consul/Nomad client agents via a preinstalled root CA cert (air-gapped, with several Vault clusters' signed as intermediaries w/ depth: 0 for Consul Connect, etc) in the trust chain, along with AppRole credentials to request their initial client certificate keypairs via vault-agent. Once up, these clients are responsible for keeping their issued client keypairs renewed via vault-agent which performs the following after successful signs:

sudo systemctl reload nomad    #### nomad ALL=(root) NOPASSWD: /usr/bin/systemctl reload nomad

This is all well and good, until Nomad completely fails on an expired Consul client certificate, despite having a valid keypair available and configured (and ignoring it on reload).

@quinndiggity
Copy link

Otherwise, is there an intention in the future to perhaps have Nomad work with Consul's auto encryption feature?

auto_encrypt = {
  tls = true # clients
}

☝🏻 this, but on the Nomad end of things, would be pretty great 👍🏻

@quinndiggity
Copy link

Any updates on intentions around this issue?

@quinndiggity
Copy link

Would be great not to need to coordinate multiple services' safe restarts solely because nomad doesn't reload client tls keypair on sighup as it should.

@nvx
Copy link
Contributor

nvx commented Sep 7, 2021

Would be great not to need to coordinate multiple services' safe restarts solely because nomad doesn't reload client tls keypair on sighup as it should.

Something that was not immediately obvious to me at first is that restarting the Nomad agent on a client does not interrupt existing allocs as long as the agent comes back up sufficiently quickly (perhaps within the heartbeat time which defaults to 10s). So gracefully reloading a nomad client becomes less important.

@quinndiggitypolymath
Copy link

Something that was not immediately obvious to me at first is that restarting the Nomad agent on a client does not interrupt existing allocs as long as the agent comes back up sufficiently quickly (perhaps within the heartbeat time which defaults to 10s). So gracefully reloading a nomad client becomes less important.

the several services in this case aren't the workload, but rather Nomad/Vault/Consul sidecars, since these all have issues with selectively reloading client tls agent credentials (whether it's nomad->consul, vault->consul, or vault agent not reloading credentials at all hashicorp/vault#8216 )

yes, the workloads aren't usually directly affected, but it becomes a much more difficult problem when coordinating a restart of nomad + vault + consul sidecars + mesh gateways + vault agent instances

@nvx
Copy link
Contributor

nvx commented Sep 13, 2021

the several services in this case aren't the workload, but rather Nomad/Vault/Consul sidecars, since these all have issues with selectively reloading client tls agent credentials (whether it's nomad->consul, vault->consul, or vault agent not reloading credentials at all hashicorp/vault#8216 )

yes, the workloads aren't usually directly affected, but it becomes a much more difficult problem when coordinating a restart of nomad + vault + consul sidecars + mesh gateways + vault agent instances

This sounds very similar to what I'm doing on my Nomad workers.

Consul reloads TLS certificates on SIGHUP fine (or consul reload against the local agent) which is important because this would cause services to disappear briefly if consul required a restart. So no coordination needed here.

Restarting Vault Agent will cause secrets to be reissued. I could see this being an issue if Vault Agent has a listener running acting as a proxy, but if you don't use this pattern (I don't, I don't think Nomad supports it, at least not for clients anyway) then no coordination needed. In practice I very rarely need to restart the Vault Agent since it's the piece that's getting credentials automatically for everything else, so there would be a bit of a dependency problem if it was trying to arrange it's own credentials too. Maybe if you were using TLS client certificates for Vault (I'm using AppRole these days) I could see this being more likely, but I'd be curious to know how you're managing that TLS certificate at that point.

And finally Nomad, which for clients needs a restart, but since it doesn't interrupt running allocs and reattaches fine also no coordination needed.

I think some better docs could be useful (especially something like here's how you can run nomad+consul+vault together with security options all turned on with Vault managing secrets, bonus points if Terraform provisions it all too) since I know I fell into some incorrect assumptions early on so I imagine others would be too, but I'm not seeing any hard blockers in todays capabilities?

@quinndiggity
Copy link

Ended up having to abandon vault agent, as there are too many instances across the product suite where client tls credentials aren't reloaded.

For those discovering this ticket when encountering these issues, I've found a periodic nomad batch job, making use of consul lock to be effective in mitigating the issues that stem from not coordinating restarts with affect on raft leadership status across Vault/Consul/Nomad.

In an ideal world, all of these would reload their client tls keypairs, but currently don't:

  • Vault: doesn't reload Consul client tls keypair; results in Vault failing to update service status (has valid keypair on disk, uses pair loaded on last restart anyways, fails to connect to Consul, will report Consul service registration as healthy forever as a result, new leaders are not elected)
  • Nomad: doesn't reload Consul client tls keypair; results in Nomad failing to register/deregister/update Consul service registration status (has valid keypair on disk, uses pair loaded on last restart anyways, fails to connect to Consul, jobs will hang without any feedback/debug information)
  • Vault Agent: doesn't reload client tls credentials at all
  • Connect Sidecars+Mesh Gateways: similar deal; will fail to talk to Consul grpc/wrapper consul process doesn't reload envoy process on SIGHUPs

Long and short of it, as it is today, there are several areas which prevent full mTLS process<->process when using Consul/Vault/Nomad, forcing restarts which require coordination - even with coordination, there are still issues users will encounter; for example, mesh gateways will fail and be restarted/rescheduled, batch jobs can be killed/restarted, etc

Would reeeeeeeeally appreciate it if every process which uses client TLS keypairs also reloads those keypairs; biggest headache is when it's a 50/50 chance whether a reload actually reloads everything it should. If a client TLS keypair is needed, that TLS keypair comes with an expiry, so it is completely nonsensical to allow the use of an expired keypair to continue on when it has been reloaded with new credentials in several other areas using the same keypair.

@quinndiggity
Copy link

Cascading failures from Vault not reloading Consul client TLS keypair can then cause every other downstream system to fail, miss renewals, and force a bootstrap of the TLS keypair before anything can go green again. Massively annoying when graceful reload/shutdown are common sense, particularly in a long lived services which are dependent on TLS and involve leadership/quorum considerations when requiring a restart.

Why reload only 1/3 of the places valid TLS keypairs are required? Implementing a code path which requires TLS, but doesn't reload the keypair, means every other implementation which DOES support reloads is dead code, because they are useless when 2/3 of connections fail entirely without a restart.

@quinndiggity
Copy link

If 100% of TLS keypairs in use were properly reloaded, end users wouldn't be needing to consider all of the following areas and whether or not they will need a full restart and how they affect each other:

  • vault
  • vault's raft cluster
  • vault's client connection to consul (service registration)
  • consul connect sidecars
  • consul connect mesh gateways
  • consul
  • consul's client connection to vault (connect ca)
  • nomad
  • nomad's client connection to consul (service discovery/registration)
  • nomad's client connection to vault
  • vault agent's connection to vault
  • several others I'm forgetting because, as you can see, this is a problem which affects a lot of systems when you consider downstream services which consume the above services in some manner

@quinndiggitypolymath
Copy link

This is still a problem; would be great not having to fully restart nomad across a ridiculous number of instances every day just because tls keypairs are only partially reloaded

@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation May 16, 2022
@jrasell jrasell moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

8 participants