Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault (Consul, HA, behind ELB) no longer works after a while #4578

Closed
dictvm opened this issue May 16, 2018 · 6 comments
Closed

Vault (Consul, HA, behind ELB) no longer works after a while #4578

dictvm opened this issue May 16, 2018 · 6 comments

Comments

@dictvm
Copy link
Contributor

dictvm commented May 16, 2018

Environment:

  • Vault Version: Vault v0.10.1 ('756fdc4587350daf1c65b93647b2cc31a6f119cd')
  • Operating System/Architecture: Ubuntu 16.04.4 LTS

Vault Config File:

api_addr = "https://vault.ivx.cloud:8200"
ui = 1

storage "consul" {
  address = "localhost:8500" # local agent
  cluster_address = "https://vault.ivx.cloud:8201"
  api_addr = "https://consul.ivx.cloud:443"
  path    = "vault"
}

listener "tcp" {
  address     = "0.0.0.0:8200"
  tls_cert_file = "/etc/vault.d/Vault.crt"
  tls_key_file = "/etc/vault.d/Vault.key"
}

listener "tcp" {
  address     = "0.0.0.0:8210" # local vault cli/api access
  tls_disable = 1
}

telemetry {
}

disable_mlock = "true"

Startup Log Output:

May 16 12:08:12 ip-172-21-71-220 vault[4495]: ==> Vault server configuration:
May 16 12:08:12 ip-172-21-71-220 vault[4495]:              Api Address: https://vault.ivx.cloud:8200
May 16 12:08:12 ip-172-21-71-220 vault[4495]:                      Cgo: disabled
May 16 12:08:12 ip-172-21-71-220 vault[4495]:          Cluster Address: https://vault.ivx.cloud:8201
May 16 12:08:12 ip-172-21-71-220 vault[4495]:               Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", tls: "enabled")
May 16 12:08:12 ip-172-21-71-220 vault[4495]:               Listener 2: tcp (addr: "0.0.0.0:8210", cluster address: "0.0.0.0:8211", tls: "disabled")
May 16 12:08:12 ip-172-21-71-220 vault[4495]:                Log Level: info
May 16 12:08:12 ip-172-21-71-220 vault[4495]:                    Mlock: supported: true, enabled: false
May 16 12:08:12 ip-172-21-71-220 vault[4495]:                  Storage: consul (HA available)
May 16 12:08:12 ip-172-21-71-220 vault[4495]:                  Version: Vault v0.10.1
May 16 12:08:12 ip-172-21-71-220 vault[4495]:              Version Sha: 756fdc4587350daf1c65b93647b2cc31a6f119cd
May 16 12:08:12 ip-172-21-71-220 vault[4495]: ==> Vault server started! Log data will stream in below:
May 16 12:08:12 ip-172-21-71-220 vault[4495]: 2018-05-16T12:08:12.950Z [WARN ] storage.consul: appending trailing forward slash to path

Output from Node #1:

May 16 11:58:55 ip-172-21-71-220 vault[4300]: 2018-05-16T11:58:55.637Z [ERROR] core: handleRequestForwarding: error forwarding request: error="error during forwarding RPC request"
May 16 11:58:55 ip-172-21-71-220 vault[4300]: 2018-05-16T11:58:55.637Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
May 16 11:58:55 ip-172-21-71-220 vault[4300]: 2018-05-16T11:58:55.637Z [ERROR] core: handleRequestForwarding: error forwarding request: error="error during forwarding RPC request"
May 16 11:58:55 ip-172-21-71-220 vault[4300]: 2018-05-16T11:58:55.637Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
May 16 11:58:55 ip-172-21-71-220 vault[4300]: 2018-05-16T11:58:55.637Z [ERROR] core: handleRequestForwarding: error forwarding request: error="error during forwarding RPC request"
May 16 11:58:55 ip-172-21-71-220 vault[4300]: 2018-05-16T11:58:55.637Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
May 16 11:58:55 ip-172-21-71-220 vault[4300]: 2018-05-16T11:58:55.637Z [ERROR] core: handleRequestForwarding: error forwarding request: error="error during forwarding RPC request"
May 16 12:00:53 ip-172-21-71-220 vault[4300]: 2018-05-16T12:00:53.095Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
May 16 12:00:53 ip-172-21-71-220 vault[4300]: 2018-05-16T12:00:53.095Z [ERROR] core: handleRequestForwarding: error forwarding request: error="error during forwarding RPC request"
May 16 12:00:53 ip-172-21-71-220 vault[4300]: 2018-05-16T12:00:53.095Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
May 16 12:00:53 ip-172-21-71-220 vault[4300]: 2018-05-16T12:00:53.095Z [ERROR] core: handleRequestForwarding: error forwarding request: error="error during forwarding RPC request"

Output from Node #2:

May 16 12:08:13 ip-172-21-72-34 vault[3990]: 2018-05-16T12:08:13.004Z [INFO ] core: post-unseal setup complete
May 16 12:08:13 ip-172-21-72-34 vault[3990]: 2018-05-16T12:08:13.007Z [INFO ] core: core/startClusterListener: starting listener: listener_address=0.0.0.0:8201
May 16 12:08:13 ip-172-21-72-34 vault[3990]: 2018-05-16T12:08:13.007Z [INFO ] core: core/startClusterListener: serving cluster requests: cluster_listen_address=[::]:8201
May 16 12:08:13 ip-172-21-72-34 vault[3990]: 2018-05-16T12:08:13.008Z [INFO ] core: core/startClusterListener: starting listener: listener_address=0.0.0.0:8211
May 16 12:08:13 ip-172-21-72-34 vault[3990]: 2018-05-16T12:08:13.008Z [INFO ] core: core/startClusterListener: serving cluster requests: cluster_listen_address=[::]:8211
May 16 12:08:21 ip-172-21-72-34 vault[3990]: http: TLS handshake error from 172.21.62.245:36785: EOF
May 16 12:09:04 ip-172-21-72-34 vault[3990]: http: TLS handshake error from 172.21.62.245:36821: EOF
May 16 12:09:04 ip-172-21-72-34 vault[3990]: http: TLS handshake error from 172.21.62.245:36823: EOF

Expected Behavior:
Vault should forward requests from the ELB to the active HA node a few minutes after being unsealed.

Actual Behavior:
Standby nodes get requests from the ELB that they cannot forward to the active node if all Vault nodes have been unsealed for more than a few minutes.

Steps to Reproduce:
Setup 3 Vault nodes using Consul as backend with the provided config.

Important Factoids:
All 3 Vault nodes are behind an AWS ELB. While this isn't atypical, it was once recommended to not place all instances behind a loadbalancer. However, I don't want to be forced to use Consul DNS for every component that should talk to Vault.

References:
none

@jefferai
Copy link
Member

Your config file has two different API addr values; the one at the top level, which is where it should be, has an address using port 8200, not 443. I'm guessing that's not the expected address through your load balancer. Additionally you have a cluster_address field in your storage configuration; that's not a valid key. You probably want cluster_addr but that should also be in the top level. Once you fix these things hopefully it will just work.

@dictvm
Copy link
Contributor Author

dictvm commented May 22, 2018

@jefferai thanks!

My loadbalancer isn't supposed to MITM TLS traffic, so the port of the Vault service should be 8200 while the ELB listens on 443 and TCP proxies the traffic to port 8200.

My Vault config.hcl currently looks like this:

api_addr = "https://vault.corp.tld:8200"
cluster_addr = "https://vault.corp.tld:8201"
ui = 1

storage "consul" {
  address = "localhost:8500" # local agent
  path    = "vault"
}

listener "tcp" {
  address     = "0.0.0.0:8200"
  tls_cert_file = "/etc/vault.d/Vault.crt"
  tls_key_file = "/etc/vault.d/Vault.key"
}

listener "tcp" {
  address     = "0.0.0.0:8210" # local vault cli/api access
  tls_disable = 1
}

However, now my ELB can no longer reach Vault but my instances are still InService according to AWS. I'm confused.

This is my understanding of how Vault should behave behind an ELB:

  • Do TLS on the Vault cluster itself
  • Use the ELB as a TCP Proxy with source port 443 and target port 8201
  • Point the ELB to the cluster address being advertised upon starting Vault, e.g. https://vault.corp.tld:8201
  • Each standby node getting traffic should forward the query to the active node
  • I should reach the Vault API and UI by setting VAULT_ADDR to https://vault.corp.tld

Am I just not getting something?

Thanks!

@jefferai
Copy link
Member

Use the ELB as a TCP Proxy with source port 443 and target port 8201

This is likely the issue.

Vault uses two ports to run: (normally) port 8200 serves the API and UI. (Normally) port 8201 is for cluster traffic only. If you are pointing your ELB to 8201 clients can't reach the API, so you want 443 on the ELB to point to 8200, not 8201.

This probably comes down to what you want going through the ELB. In general I think it's a bad idea for cluster traffic to go through the ELB and you should probably have nodes talk directly -- but in that case you need each node to advertise its address, not he ELB, in its cluster_addr setting. If you want it to go through the ELB (again, not recommended), pick a port on the ELB like 444 and have all cluster_addr values for all nodes set to that.

On the API side, if you have 443 going to Vault port 8200, you really want your api_addr set to port 443.

@dictvm
Copy link
Contributor Author

dictvm commented May 24, 2018

@jefferai thanks so much for your patience. I just ripped out the ELB and I'm now using DNS round robin to reach the Vault cluster. It's much simpler and thus less error prone and all of my issues have disappeared. 👍

@dictvm dictvm closed this as completed May 24, 2018
@jefferai
Copy link
Member

Cool! Glad it's working. And request forwarding from standbys is there specifically to make things like this easier :-)

@h0ppyf33t
Copy link

Is it possible to run Vault ui on 443?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants