-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd Catalog Disaster Recovery is broken #7310
Comments
Hi @schuylr, it looks like metadata and possibly other infrastructure services are unstable. Since you mention containers flapping, Docker daemons locking up, I am guessing there is some sort of resource exhaustion occurring. I would recommend you tear down whatever user stacks and stabilize the infrastructure services first. Unfortunately, the etcd (v2.3.7-6) in community-catalog is an ephemeral version - if you've lost all of the data containers, all state is lost forever. Our supported kubernetes stack uses a version which persists state to the host, FYI. |
@LLParse It was more of the known bugs that stem from moby/moby#13885 and coreos/bugs#1654 (I'm running CoreOS 1185.3.0) that caused it, in my opinion. I have the data backed up from the survivors already and I'm starting from scratch. My workaround is to stop the service, drop in the backed up folder, set the disaster recovery flag with It would be nice to see the community etcd support an option to allow us to select the ephemeral or host mapped one. I'm heavily invested in cattle at this point and don't want to change orchestration methods. |
Ok, so brand new etcd cluster breaks entirely.
|
The issue with the community-supported etcd cluster lies here in
As you can see, the endpoint name is inferred to be It looks like you fixed it in rancher/catalog-dockerfiles#85 |
Yep, you're right.. nice catch! It looks like this is fallout from RFC compliance refactor. I will open up a PR to bring the latest |
@LLParse Thanks! I'm doing further testing too with this build as well. I think disaster recovery won't work with the data containers, and I'm also partially worried that it won't work with mounted volumes either. In both scenarios, Rancher is still killing the recovered node and rescheduling it on another host :( |
@schuylr Maybe disaster recovery wasn't working very well back in In NFSv4 and etcd 2.X don't mix, so if you are mounting NFS shares on the underlying host for use by etcd, well, don't do that. You can send the backups to an NFS share. See this wiki for etcd operational instructions pertaining to disaster, backup creation/restoration, etc. |
@LLParse Nope, no NFS here. It's the data container or host volumes from here. I'm already learning a bunch of etcd disaster recovery stuff. I may have figured out the issue though. When Rancher 1.2.0 did the upgrade, it left a slew of old Stopped Rancher agents, and it keeps trying to start them back up. It's insanity, and the process queue is constantly full of hung processes. I'm going through all of the Stopped containers and removing them, which should hopefully free up the process queue so there's no flapping containers while the disaster node tries to restart. Will keep you posted. |
Ugh, very odd. I pushed out the updated etcd template that will work with Rancher 1.2, it comes with periodic backups and functional DR automation... if you have a |
Thanks! I built your image myself and it works great now that I managed to get some proper backed-up data into the etcd cluster, so I'll switch it over to your catalog entry once I have a chance. |
@LLParse I switched back to the community catalog using the API in Rancher and it's not recognizing |
@schuylr Looks like we deprecated our homegrown version comparison logic in favor of github.com/blang/semver. I'll also have to remove the max_rancher_version restriction for the latest template to enable the upgrade path. That seems to be the only option in this case. |
Thanks - I'll wait on this. If I've already mapped the |
You can map both to |
Rancher Version: 1.2.2
Docker Version: 1.11.2
OS and where are the hosts located? (cloud, bare metal, etc): AWS EC2
Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB) Single Node External DB
Environment Type: (Cattle/Kubernetes/Swarm/Mesos) Cattle
Steps to Reproduce:
This happened when I did the 1.2.0 environment upgrade - a lot of the Docker daemons locked up from a massive number of container addition/removals during the upgrade, causing Rancher to kill a majority of etcd nodes due to failed health checks.
Kill more than N/2 hosts to invoke a disaster. Try to start disaster recovery on a surviving node by executing shell and typing
disaster
Results:
Disaster flag gets set, container shuts down, tries to start, but then gets killed by Rancher before a start can even be completed. The corresponding data container also gets killed :(
Expected:
A running disaster-recovery etcd node.
I have 2/9 etcd instances now running and no way to recover the remaining 7.
Update
I sacrificed one more node and got the following logs from the
disaster
:I'm pretty sure that Rancher deletes the data container before disaster recovery completes, causing the above to happen.
The text was updated successfully, but these errors were encountered: