Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Support graceful reloading of TLS configuration #3247

Closed
SoMuchToGrok opened this issue Sep 19, 2017 · 9 comments
Closed

[feature] Support graceful reloading of TLS configuration #3247

SoMuchToGrok opened this issue Sep 19, 2017 · 9 comments

Comments

@SoMuchToGrok
Copy link

SoMuchToGrok commented Sep 19, 2017

Within the TLS configuration stanza for nomad servers and clients, changes to cert_file, ca_file, or key_file require a full service restart. This can be cumbersome in an environment that frequently rotates its PKI.

tls {
  http = true
  rpc  = true

  ca_file   = "/usr/local/share/ca-certificates/vault.crt"
  cert_file = "/secrets/nomad/server.crt"
  key_file  = "/secrets/nomad/server.key"

  verify_server_hostname = true
}

Is it feasible to implement on-the-fly changes by sending a SIGHUP? For context, this is the pattern that Vault server utilizes, see PR here. This would be an invaluable feature for us and would drastically reduce the complexity of maintaining highly-available systems.

Currently, we're using triggers from consul-template combined with consul locks to maintain server quorum. Adapting this pattern to the nomad clients is a little more challenging, as consul lock timeouts (as well as consul-template timeouts) need to scale proportionally to the number of clients in the cluster. On the client side, we've accepted the fact that running jobs will occasionally have momentarily blips due to PKI rotations.

It would be fantastic to avoid these workarounds and limitations. How complex would this be? I'm trying to get an understanding if this is something we can help implement.

@schmichael
Copy link
Member

While Nomad currently supports SIGHUP-based config reloading it's unfortunately limited to log level and vault. Thanks for the good writeup and usecase. This seems like something Nomad should handle.

@SoMuchToGrok
Copy link
Author

SoMuchToGrok commented Nov 6, 2017

I think this might be it :)

#3492

Will be paying attention to that pull request in the coming days.

@dadgar
Copy link
Contributor

dadgar commented Nov 17, 2017

@SoMuchToGrok This PR actually adds the ability to SIGHUP and reload certs: #3479

The PR you linked will be a bit more comprehensive and will allow adding or removing TLS altogether with a SIGHUP.

I am closing as this has been merged into master and will land in 0.7.1 👍

@dadgar dadgar closed this as completed Nov 17, 2017
@SoMuchToGrok
Copy link
Author

This feature will make my life so much easier. Keep up the great work Hashi team!

@SoMuchToGrok
Copy link
Author

There may have been a regression with this logic sometime after v0.7.1, specifically around reloading the RPC TLS config. After upgrading from v0.7.1 to v0.8.3, it appears that all RPC communication fails after sending a SIGUP to a process when the contents of the TLS certificates have changed (just the data itself, not the location of the certs on the filesystem). I'll open up a new ticket if I can confirm this.

Nomad client logs

Jun 12 14:28:30 nomad-c-stag-2-187 nomad[7210]: nomad: "Node.UpdateStatus" RPC failed to server 10.3.2.10:4647: rpc error: EOF
Jun 12 14:28:30 nomad-c-stag-2-187 nomad[7210]: client: heartbeating failed. Retrying in 1.269142916s: failed to update status: rpc error: EOF
Jun 12 14:29:21 nomad-c-stag-2-187 nomad[7210]:     2018/06/12 14:29:21.122561 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 10.3.2.10:4647: rpc error: EOF
Jun 12 14:29:21 nomad-c-stag-2-187 nomad[7210]:     2018/06/12 14:29:21.122633 [ERR] client: heartbeating failed. Retrying in 1.526904628s: failed to update status: rpc error: EOF

Task state logs (retrieved via Nomad API)

vault: failed to derive token: DeriveVaultToken RPC failed: rpc error: EOF

@schmichael
Copy link
Member

@SoMuchToGrok I tested certificate reloading in 0.8.4 and ran into issues that I mentioned in #4408. I didn't see RPC issues, but I was only running a single dev agent. Feel free to add comments to that issue or open up a new one. Sorry for the trouble!

@schmichael
Copy link
Member

@SoMuchToGrok The team looked into this, and I wanted to mention two things we found:

  1. The EOFs are to be expected: existing connections using the old certificates are closed. Everything should reconnect and use the new certificates and not cause any further errors.
  2. Certificate not reloaded on SIGHUP for dev agents #4408 only affects dev agents. SIGHUP works properly for regular clients and servers.

@SoMuchToGrok
Copy link
Author

@schmichael appreciate you and the team taking time to look into it.

I tried to reproduce #4408 and I can confirm that I'm not experiencing that.

I'm still trying to make sense of what issue I'm seeing. From what I've seen so far, the EOF errors never stop and continue endlessly. Additionally, once the EOF errors do start, all Vault interactions fail with an RPF EOF error. I've been able to reproduce this in 2 different environments now - the only resolution so far has been restarting the nomad service.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants