Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standby nodes should proxy traffic, not issue a redirect #443

Closed
bkrodgers opened this issue Jul 21, 2015 · 28 comments
Closed

Standby nodes should proxy traffic, not issue a redirect #443

bkrodgers opened this issue Jul 21, 2015 · 28 comments

Comments

@bkrodgers
Copy link
Contributor

In issue #389, I asked for the ability to set the health check page to return 200 for standby nodes, so that they live in the load balancer and are ready to take over for a failed node as soon as a new election is made. Thank you for quickly getting that change made. However, now that .2 is out and I've tried it, I discovered it doesn't actually work they way I need it to.

It looks like Vault standby nodes don't proxy their traffic to the active node, they simply issue a redirect to the client. That doesn't work, since i have all of my vault nodes behind a load balancer and not directly accessible from outside a private subnet. Could we consider having Vault proxy instead? As with my earlier request, what I'm trying to do is reduce the downtime when a node fails. If only the active node is in the load balancer, Vault has to first detect the failure, elect a new leader, and then a minimum of 2 health checks (Amazon ELB's minimum) have to pass before the new node comes online.

Ultimately I'd still like to see Active/Active be a possibility (#390), but for now proxying instead of redirecting would help.

@armon
Copy link
Member

armon commented Jul 22, 2015

Proxying is something we considered, but is a bit of a nightmare in many other ways. It opens up a lot of new attack vectors that we were hesitant to deal with. Forwarding keeps things as a simple client to server with no intermediary. It's something we may consider in the longer term, but it's not a simple short term enhancement.

@OWSM
Copy link

OWSM commented Feb 9, 2016

@armon Does this mean that DNS based load balancing is the recommended method?

@jefferai
Copy link
Member

jefferai commented Feb 9, 2016

@OWSM No; it is a supported method, but there are multiple ways you could load balance. It depends on the constraints of your setup.

@OWSM
Copy link

OWSM commented Feb 9, 2016

@jefferai Do you have any documentation on the best practices here? I'm in the same boat as @bkrodgers with my typical model being to toss stuff behind an ELB

@jefferai
Copy link
Member

jefferai commented Feb 9, 2016

Minimal downtime using AWS ELB is difficult because of the minimum times required before switching backends -- health checks to run (5 or more seconds) with consecutive failures/healthy statuses (2 minimum) for an absolute minimum of 10 seconds to switch nodes. ELB simply doesn't have the capabilities of other load balancers to better handle standby situations and direct traffic appropriately.

If you want minimum downtime, our recommendation is to simply not use ELB with Vault. ELB should only be used in TCP mode with Vault, so Vault should be handling the TLS part of the communication with a client already anyways.

This does lead to the question of how to pick which node to talk to. One obvious way is to use service discovery (for instance, via Consul) either via DNS or HTTP to discover the current node. Another is to simply use round robin DNS to let your client connect to any node but be redirected to the proper node; this would make that particular connection take longer, but in a failure scenario it will recover much more quickly.

If you still want connections to go via ELB, a third possibility is to have ELB direct traffic to a TCP-proxying-capable service like haproxy or nginx. These can be used in various ways, but two possibilities are using consul-template to keep config files updated and issue a reload as the service statuses change; or, use something like nginx's resolve parameter with a low timeout (such as 1 second) to monitor the current active node via DNS and switch as appropriate.

@OWSM
Copy link

OWSM commented Feb 9, 2016

@jefferai I am personally OK with not using ELB as my primary load balancer. was mostly wondering what the recommended method is.

I'm thinking about having 2 nodes behind separate ELBs, and using failover routing to route traffic appropriately. The problem with this of course is that when the original primary is healthy again it will slow everyone down with redirects.

I don't have any experience with Consul, so I don't really know how that solution would work.

@OWSM
Copy link

OWSM commented Feb 9, 2016

Actually, I suppose if I only use ?standbyok on the failover, it won't switch back to the primary unless I reset things

@jefferai
Copy link
Member

jefferai commented Feb 9, 2016

standbyok can help because it keeps all of the nodes InService, but the problem there is indicating to ELB which one to prefer (I really don't know ELB's capabilities there, so it may be simple). You don't want to have ELB pointing every single request to a standby because it's getting 200 and keeps assuming it's a great node to send traffic to. In that scenario you'd still need your clients to be able to access Vault directly, so that if ELB sends them to a standby, the standby doesn't redirect right back to the ELB -- the standby would have to give out (via the active node's advertise_addr) the direct address of the active node.

This does help in terms of dealing with downtime, so long as ELB -- even if it thinks a node is InService -- will stop routing traffic to it if it is seeing 500s or timeouts from a backend and prefer one of the other InService nodes while it goes through the health check timeout period. It also lets you keep a single DNS point of origin without setting up a load balancer behind your load balancer. But you'd definitely want to know the behavior of ELB in those two respects (node preference selection and failures on an InService node before the health check triggers), as depending on the answers to those you may not actually shorten your down time.

As for Consul, it's a service discovery and key/value store (and one of HashiCorp's projects). You can find out more about it at https://www.consul.io/ -- besides providing DNS and HTTP based ways to discover active/alive services, it can also power other tools like consul-template to rewrite config files and reload services as needed.

@OWSM
Copy link

OWSM commented Feb 9, 2016

What I'm thinking of doing is the following:

           Route53 DNS
           /          \
   ELBPrimary       ELBFailover
       /                  \
PrimaryInstance      StandbyInstance

Route53 would use failover routing to direct traffic at ELBPrimary if it has healthy instances and at ELBFailover if Primary does not have healthy instances and Failover does.

The primary instance can be on the Primary ELB. This ELB only lists an instance as healthy if it is the active node.

Only the Standby instance would be in the Failover ELB. This ELB uses the ?standbyok flag.

In the case of a failover event where the original primary goes down, the ELB will mark it as unhealthy and Route53 will redirect traffic to the Failover ELB who has a healthy instance in standby. this limits the downtime to only as long as it takes for that instances to elect itself the new primary.

I can then get a new node in ELBPrimary, and seal+unseal the vault on the Standby instance to force primary to switch back to the instance in the correct ELB to reset the system.

Have I missed anything that would make this system fail?

@jefferai
Copy link
Member

jefferai commented Feb 9, 2016

In the case of a failover event where the original primary goes down, the ELB will mark it as unhealthy and Route53 will redirect traffic to the Failover ELB who has a healthy instance in standby. this limits the downtime to only as long as it takes for that instances to elect itself the new primary.

I believe it actually still takes as long as ELB takes to mark it unhealthy (minimum of ten seconds), whereas depending on how HA in Vault is set up and the nature of Vault becoming unavailable, the new active node should take over in < 1 second.

The proposed architecture has benefits with blue/green deployment strategies or if you want your Vault nodes in different availability zones, but still has higher theoretical downtime than need be.

Of course, everything is theory and what matters is what you can tolerate. If you want absolute minimum downtime possible, relying on ELB health checks will not allow for this. If you want minimum downtime within the constraints of AWS, ELB is fine so long as health check settings are tuned appropriately.

@OWSM
Copy link

OWSM commented Feb 9, 2016

@jefferai You are correct, I hadn't thought of that. It sounds like until Vault supports active > active master nodes (if it ever does) the best options are going to be DNS round robin or Consul. Have spent 2 days reading Vault Documentation and now it looks like I'm going to have to spend some quality time with Consul docs as well.

@bkrodgers
Copy link
Contributor Author

For the time being, I've accepted that I'll have a brief outage. I seal the active node, and within about 10 seconds one of my standbys takes over. As discussed, there are 3 stages of the outage:

  1. After sealing, but before the ELB sees that node fail the health check, I get "vault is sealed," as it is still routing to that node. This lasts for 5-10 seconds, since it has to fail twice on a 5 second check interval.
  2. Once it fails the health check, there may be a period where no servers are in the ELB until step 3 completes.
  3. Once one of the new nodes takes over from the Vault side, (which typically should be very quick), it still won't start taking traffic for 5-10 seconds, since it has to pass twice on a 5 second check interval.

These steps aren't sequential though -- the health checks go through all servers on that 5 second interval, so the time of step 1 and 3 are concurrent. Step 2 usually doesn't happen for long, or at all, but there often is a little gap between when you move from step 1 to step 3.

?standbyok ended up not being helpful for my situation, since I don't expose the vault nodes outside of through the ELB. I didn't realize that when I requested the feature. :)

Obviously I'd still like a better solution that gets me to 0, but a 5-10 second blip has been workable thus far. I'm trying to make sure that everyone coding against Vault considers retry logic anyway.

@jefferai
Copy link
Member

jefferai commented Feb 9, 2016

Obviously I'd still like a better solution that gets me to 0, but a 5-10 second blip has been workable thus far. I'm trying to make sure that everyone coding against Vault considers retry logic anyway.

Good to hear, and good idea to suggest that to your developers.

I know a former ELB product manager who has told me that ELB was meant to be cheap and reliable, not flexible. Which is totally sane from Amazon's side, but also means that the downtime is higher than it could be with more modern LB features in it. Bypassing ELB can get you lesser downtime but at a cost of more work/configuration/monitoring/etc. on your end. In the end (like most things!) it just comes down to what set of tradeoffs you can live with.

@OWSM
Copy link

OWSM commented Feb 9, 2016

I may have to do the same @bkrodgers

Of course there is also the necessary consideration that adding a node to the cluster is a manual process, which is something I would typically avoid. Unfortunately the only other solution to the problem I have of issuing AWS keys through LDAP is OneLogin, and I don't particularly want to pay for that service.

@OWSM
Copy link

OWSM commented Feb 9, 2016

@jefferai ELB was also only designed for active/active configurations.

@bkrodgers
Copy link
Contributor Author

Well, as long as you adhere to the nuclear key unlock principle that is central to Vault, yeah, it'll always be manual to some extent. I've got everything scripted to start up, but we do have to go in and manually unlock things.

Of course if you don't feel a need to use that principle, you can always store the keys together and script the unlock process...you'd also need a place to put those keys where they can be secure so only your automation script can get them. I suppose you could put those in vault itself, so that as long as one vault node is unlocked the others can read from vault to unlock themselves. But the idea behind the distributed unlock process is a sound one from a security perspective. Whether or not you feel that outweighs the manual aspect of the process is one you and your security team needs to consider.

@OWSM
Copy link

OWSM commented Feb 9, 2016

@bkrodgers No, it absolutely makes sense. Just thining about alternative tools like credstash.

I think I like credstash better for generic secrets, but it can't do any of the nice dynamic secrets stuff, and I don't want two different tools

@jefferai
Copy link
Member

jefferai commented Feb 9, 2016

@OWSM Yeah, it's designed for active/active, but...that's what I was saying about "not flexible"...Vault isn't the only active/standby process out there. Most front-end web apps are easily active-active, though, because they're not dealing with state or actual data management. So it's a 95% thing...they're designing to the 95% (or maybe even 99%) but not all applications can fit into that model.

@OWSM
Copy link

OWSM commented Feb 10, 2016

@jefferai No, it's not the only active/standby process out there, but I continue to be surprised by anyone calling that HA in this modern age (though it absolutely was just 5 years ago)

@jefferai
Copy link
Member

@OWSM < 1 second failover is not long, and with a client that performs a retry on a 5xx error, there should never be client-observable downtime. I don't think calling it HA is far-fetched at all.

@OWSM
Copy link

OWSM commented Feb 10, 2016

we're talking about <1s for the new master to take over sure, but you are passing onto the user the implementation of all of the required logic to properly direct traffic unless the nodes are directly addressable (which is less and less common), and even then only if you want to accept the (admittedly minimal) latency introduced by a redirect to the master node.

I like a lot of your implementation, but IMO warm failover != HA.

@jefferai
Copy link
Member

If you want to use an ELB, that comes with pros and cons, like anything else. One con is going to be higher time required for failover -- something that, by the way, ELB imposes on any service -- so long as ELB continues routing to a failed node of any service until its health check fails to pass, users will see errors. Even if ELB takes into account 500 errors as a sign to stop routing there, a non-responsive node will certainly require a delay until ELB decides it's gone, not just taking time to respond. So even having active-active isn't a guarantee clients won't see problems when a node runs into issues.

However, I take issue with your base assumption. Most users of Vault don't have Vault accessible to the public Internet, and as such, having the Vault nodes directly accessible is completely possible -- in fact, HC's internal Vault was on an internal ELB for historical reasons but is switching over to direct access, because ELB doesn't provide us any benefit here (anymore) and can slow down failover. It also opens up possibilities like service discovery (of which Consul is one but not the only option out there) that can reduce that time significantly.

And, even if you want access to Vault from the public Internet, ELB is not a slam-dunk case. Vault should only be used with any load balancer in TCP mode, so Vault is already handling TLS. ELB doesn't really shield you from DoS attacks. ELB can provide a single point of ingress, but so can other load balancers -- ones that can be updated to reflect a new active node far faster than ELB.

Given the above, I don't think Vault nodes being directly accessible are that big an issue; if you can handle the latency from redirects, you can still use ELB as a single point of ingress, use standbyok so that any node can handle traffic, and let Vault figure out who the current active node is instead of ELB. Purely in terms of availability that's pseudo active-active, with a <1s leadership transition.

@OWSM
Copy link

OWSM commented Feb 10, 2016

@jefferai it's fair to say that in most cases vault would be entirely internal, but I'm not sure how that is relevant to my points.

Also, ELBs do have multiple layers of built in DOS protection, and using TLS on the ELB means I can use Amazon's managed certificates (new) and just use self-signed on the actual server

@jefferai
Copy link
Member

it's fair to say that in most cases vault would be entirely internal, but I'm not sure how that is relevant to my points

Most people would find less of a reason to prevent direct access to the Vault nodes if all access is internal, although as I pointed out, I think it's generally fine anyways.

Also, ELBs do have multiple layers of built in DOS protection

Maybe, but I sure have seen a lot of guides over the years about how to prevent DoS attacks when using ELB.

using TLS on the ELB means I can use Amazon's managed certificates (new) and just use self-signed on the actual server

This is not our recommended configuration, and as such, it gives you a different set of constraints to work with. As I mentioned earlier, we only ever recommend using TCP proxying on a load balancer with Vault, and my assertions have been made with this scenario in mind.

@seeder
Copy link

seeder commented Aug 15, 2016

I am having a interesting problem when using consul/traefik to load balance vault access.

How do i prevent vault registering unsealed standby nodes as healthy ?
It would be nice to have the option to be able to have an option to have standby node either add a custom tag to their service registration or not mark themselves as healthy and only have the leader marked as such.

@jefferai
Copy link
Member

They already do register appropriate tags (active, standby) although I forget if that's in 0.6 or the 0.6.1 RCs.

@seeder
Copy link

seeder commented Aug 15, 2016

Yes they do, however they are not set from config and therefore hard to use with traefik lb, as LBs usually required some custom tags like traefik.enable=false to get them to do something.

It would be nice to have a health check for standby as there is one for seal/unseal already being registered by vault in consul.

@jefferai
Copy link
Member

@seeder if you can do a check against the HTTP API (rather than DNS) then it's easy to figure out the status/state of any of the Vault nodes from the output. It's the DNS API that is hard to use simply because of limitations with DNS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants