[feature] Support graceful reloading of TLS configuration #3247

SoMuchToGrok · 2017-09-19T11:59:00Z

Within the TLS configuration stanza for nomad servers and clients, changes to cert_file, ca_file, or key_file require a full service restart. This can be cumbersome in an environment that frequently rotates its PKI.

tls {
  http = true
  rpc  = true

  ca_file   = "/usr/local/share/ca-certificates/vault.crt"
  cert_file = "/secrets/nomad/server.crt"
  key_file  = "/secrets/nomad/server.key"

  verify_server_hostname = true
}

Is it feasible to implement on-the-fly changes by sending a SIGHUP? For context, this is the pattern that Vault server utilizes, see PR here. This would be an invaluable feature for us and would drastically reduce the complexity of maintaining highly-available systems.

Currently, we're using triggers from consul-template combined with consul locks to maintain server quorum. Adapting this pattern to the nomad clients is a little more challenging, as consul lock timeouts (as well as consul-template timeouts) need to scale proportionally to the number of clients in the cluster. On the client side, we've accepted the fact that running jobs will occasionally have momentarily blips due to PKI rotations.

It would be fantastic to avoid these workarounds and limitations. How complex would this be? I'm trying to get an understanding if this is something we can help implement.

The text was updated successfully, but these errors were encountered:

schmichael · 2017-09-25T18:28:03Z

While Nomad currently supports SIGHUP-based config reloading it's unfortunately limited to log level and vault. Thanks for the good writeup and usecase. This seems like something Nomad should handle.

SoMuchToGrok · 2017-11-06T17:26:53Z

I think this might be it :)

#3492

Will be paying attention to that pull request in the coming days.

dadgar · 2017-11-17T00:39:31Z

@SoMuchToGrok This PR actually adds the ability to SIGHUP and reload certs: #3479

The PR you linked will be a bit more comprehensive and will allow adding or removing TLS altogether with a SIGHUP.

I am closing as this has been merged into master and will land in 0.7.1 👍

SoMuchToGrok · 2017-11-17T13:03:35Z

This feature will make my life so much easier. Keep up the great work Hashi team!

SoMuchToGrok · 2018-06-12T15:03:09Z

There may have been a regression with this logic sometime after v0.7.1, specifically around reloading the RPC TLS config. After upgrading from v0.7.1 to v0.8.3, it appears that all RPC communication fails after sending a SIGUP to a process when the contents of the TLS certificates have changed (just the data itself, not the location of the certs on the filesystem). I'll open up a new ticket if I can confirm this.

Nomad client logs

Jun 12 14:28:30 nomad-c-stag-2-187 nomad[7210]: nomad: "Node.UpdateStatus" RPC failed to server 10.3.2.10:4647: rpc error: EOF
Jun 12 14:28:30 nomad-c-stag-2-187 nomad[7210]: client: heartbeating failed. Retrying in 1.269142916s: failed to update status: rpc error: EOF
Jun 12 14:29:21 nomad-c-stag-2-187 nomad[7210]:     2018/06/12 14:29:21.122561 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 10.3.2.10:4647: rpc error: EOF
Jun 12 14:29:21 nomad-c-stag-2-187 nomad[7210]:     2018/06/12 14:29:21.122633 [ERR] client: heartbeating failed. Retrying in 1.526904628s: failed to update status: rpc error: EOF

Task state logs (retrieved via Nomad API)

vault: failed to derive token: DeriveVaultToken RPC failed: rpc error: EOF

schmichael · 2018-06-12T18:58:35Z

@SoMuchToGrok I tested certificate reloading in 0.8.4 and ran into issues that I mentioned in #4408. I didn't see RPC issues, but I was only running a single dev agent. Feel free to add comments to that issue or open up a new one. Sorry for the trouble!

schmichael · 2018-06-12T20:52:06Z

@SoMuchToGrok The team looked into this, and I wanted to mention two things we found:

The EOFs are to be expected: existing connections using the old certificates are closed. Everything should reconnect and use the new certificates and not cause any further errors.
Certificate not reloaded on SIGHUP for dev agents #4408 only affects dev agents. SIGHUP works properly for regular clients and servers.

SoMuchToGrok · 2018-06-13T11:42:56Z

@schmichael appreciate you and the team taking time to look into it.

I tried to reproduce #4408 and I can confirm that I'm not experiencing that.

I'm still trying to make sense of what issue I'm seeing. From what I've seen so far, the EOF errors never stop and continue endlessly. Additionally, once the EOF errors do start, all Vault interactions fail with an RPF EOF error. I've been able to reproduce this in 2 different environments now - the only resolution so far has been restarting the nomad service.

github-actions · 2022-11-29T02:18:49Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael added theme/config type/enhancement theme/tls labels Sep 25, 2017

dadgar closed this as completed Nov 17, 2017

Xopherus mentioned this issue Nov 13, 2019

[feature] allow reload of Consul config stanza and Consul client #4593

Open

github-actions bot locked as resolved and limited conversation to collaborators Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Support graceful reloading of TLS configuration #3247

[feature] Support graceful reloading of TLS configuration #3247

SoMuchToGrok commented Sep 19, 2017 •

edited

schmichael commented Sep 25, 2017

SoMuchToGrok commented Nov 6, 2017 •

edited

dadgar commented Nov 17, 2017

SoMuchToGrok commented Nov 17, 2017

SoMuchToGrok commented Jun 12, 2018

schmichael commented Jun 12, 2018

schmichael commented Jun 12, 2018

SoMuchToGrok commented Jun 13, 2018

github-actions bot commented Nov 29, 2022

[feature] Support graceful reloading of TLS configuration #3247

[feature] Support graceful reloading of TLS configuration #3247

Comments

SoMuchToGrok commented Sep 19, 2017 • edited

schmichael commented Sep 25, 2017

SoMuchToGrok commented Nov 6, 2017 • edited

dadgar commented Nov 17, 2017

SoMuchToGrok commented Nov 17, 2017

SoMuchToGrok commented Jun 12, 2018

schmichael commented Jun 12, 2018

schmichael commented Jun 12, 2018

SoMuchToGrok commented Jun 13, 2018

github-actions bot commented Nov 29, 2022

SoMuchToGrok commented Sep 19, 2017 •

edited

SoMuchToGrok commented Nov 6, 2017 •

edited