Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd Catalog Disaster Recovery is broken #7310

Closed
schuylr opened this issue Jan 5, 2017 · 14 comments
Closed

etcd Catalog Disaster Recovery is broken #7310

schuylr opened this issue Jan 5, 2017 · 14 comments

Comments

@schuylr
Copy link

schuylr commented Jan 5, 2017

Rancher Version: 1.2.2

Docker Version: 1.11.2

OS and where are the hosts located? (cloud, bare metal, etc): AWS EC2

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB) Single Node External DB

Environment Type: (Cattle/Kubernetes/Swarm/Mesos) Cattle

Steps to Reproduce:

This happened when I did the 1.2.0 environment upgrade - a lot of the Docker daemons locked up from a massive number of container addition/removals during the upgrade, causing Rancher to kill a majority of etcd nodes due to failed health checks.

Kill more than N/2 hosts to invoke a disaster. Try to start disaster recovery on a surviving node by executing shell and typing disaster

Results:

Disaster flag gets set, container shuts down, tries to start, but then gets killed by Rancher before a start can even be completed. The corresponding data container also gets killed :(

screen shot 2017-01-05 at 9 57 25 am

Expected:

A running disaster-recovery etcd node.

I have 2/9 etcd instances now running and no way to recover the remaining 7.

Update

I sacrificed one more node and got the following logs from the disaster:

2017-01-05 15:42:37.111372 N | osutil: received terminated signal, shutting down...
2017-01-05 15:42:37.114198 E | rafthttp: failed to read d309b12a4796d448 on stream MsgApp v2 (net/http: request canceled)
2017-01-05 15:42:37.114221 I | rafthttp: the connection with d309b12a4796d448 became inactive
/run.sh: line 67:   296 Terminated              etcd --name ${NAME} --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://${IP}:2379 --listen-peer-urls http://0.0.0.0:2380 --initial-advertise-peer-urls http://${IP}:2380 --initial-cluster-state existing --initial-cluster $cluster
time="2017-01-05T15:43:14Z" level=fatal msg="Error 404 accessing /self/stack/services/etcd path" 
time="2017-01-05T15:43:14Z" level=fatal msg="Failed to find IP: Error 404 accessing /self/container path" 
wget: server returned error: HTTP/1.1 404 Not Found
wget: server returned error: HTTP/1.1 404 Not Found
wget: server returned error: HTTP/1.1 404 Not Found
Creating a DR backup...
Sanitizing DR backup...
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
2017-01-05 15:43:15.474856 E | etcdmain: error verifying flags, '/data/data.20170105.154314.DR' is not a valid flag. See 'etcd --help'.
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused

I'm pretty sure that Rancher deletes the data container before disaster recovery completes, causing the above to happen.

@LLParse
Copy link

LLParse commented Jan 5, 2017

Hi @schuylr, it looks like metadata and possibly other infrastructure services are unstable. Since you mention containers flapping, Docker daemons locking up, I am guessing there is some sort of resource exhaustion occurring. I would recommend you tear down whatever user stacks and stabilize the infrastructure services first.

Unfortunately, the etcd (v2.3.7-6) in community-catalog is an ephemeral version - if you've lost all of the data containers, all state is lost forever. Our supported kubernetes stack uses a version which persists state to the host, FYI.

@schuylr
Copy link
Author

schuylr commented Jan 5, 2017

@LLParse It was more of the known bugs that stem from moby/moby#13885 and coreos/bugs#1654 (I'm running CoreOS 1185.3.0) that caused it, in my opinion.

I have the data backed up from the survivors already and I'm starting from scratch. My workaround is to stop the service, drop in the backed up folder, set the disaster recovery flag with mkdir -p /data/DR, and then start the container again, though I'm getting some issues with unequal member counts right now.

It would be nice to see the community etcd support an option to allow us to select the ephemeral or host mapped one. I'm heavily invested in cattle at this point and don't want to change orchestration methods.

@schuylr
Copy link
Author

schuylr commented Jan 5, 2017

Ok, so brand new etcd cluster breaks entirely. etcd-ha-etcd-1 works fine, but the second through Nth node all crap out with this:

Waiting for lower index nodes to all be active
OK
No etcd nodes available
2017-01-05 18:33:07.082108 I | flags: recognized and used environment variable ETCD_DATA_DIR=/data/data.current
2017-01-05 18:33:07.082280 I | etcdmain: etcd Version: 2.3.7
2017-01-05 18:33:07.082293 I | etcdmain: Git SHA: fd17c91
2017-01-05 18:33:07.082299 I | etcdmain: Go Version: go1.6.2
2017-01-05 18:33:07.082303 I | etcdmain: Go OS/Arch: linux/amd64
2017-01-05 18:33:07.082309 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2017-01-05 18:33:07.082390 I | etcdmain: listening for peers on http://0.0.0.0:2380
2017-01-05 18:33:07.082426 I | etcdmain: listening for client requests on http://0.0.0.0:2379
2017-01-05 18:33:07.126500 I | etcdmain: stopping listening for client requests on http://0.0.0.0:2379
2017-01-05 18:33:07.126535 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2017-01-05 18:33:07.126546 C | etcdmain: error validating peerURLs {ClusterID:e6327abfd2941b19 Members:[&{ID:28caae647f0cad1e RaftAttributes:{PeerURLs:[http://10.42.98.64:2380]} Attributes:{Name:etcd-ha-etcd-1 ClientURLs:[http://10.42.98.64:2379]}}] RemovedMemberIDs:[]}: member count is unequal

@schuylr
Copy link
Author

schuylr commented Jan 5, 2017

The issue with the community-supported etcd cluster lies here in run.sh:

etcdctln() {
    target=0
    for j in $(seq 1 5); do
        for i in $(seq 1 $SCALE); do
            giddyup probe http://${STACK_NAME}_etcd_${i}:2379/health &> /dev/nul
            if [ "$?" == "0" ]; then
                target=$i
                break
            fi
        done
        if [ "$target" != "0" ]; then
            break
        fi
        sleep 1
    done
    if [ "$target" == "0" ]; then
        echo No etcd nodes available
    else
        etcdctl --endpoints http://${STACK_NAME}_etcd_$target:2379 $@
    fi
}

As you can see, the endpoint name is inferred to be ${STACK_NAME}_etcd_$target when it should be the new Rancher 1.2.x naming convention of ${STACK_NAME}-etcd-$target. Without this, giddyup probe fails due to a non-existent hostname.

It looks like you fixed it in rancher/catalog-dockerfiles#85

@LLParse
Copy link

LLParse commented Jan 5, 2017

Yep, you're right.. nice catch! It looks like this is fallout from RFC compliance refactor.

I will open up a PR to bring the latest rancher/etcd:v2.3.7-11 to the community catalog and will be available to you shortly.

@schuylr
Copy link
Author

schuylr commented Jan 5, 2017

@LLParse Thanks! I'm doing further testing too with this build as well. I think disaster recovery won't work with the data containers, and I'm also partially worried that it won't work with mounted volumes either. In both scenarios, Rancher is still killing the recovered node and rescheduling it on another host :(

@LLParse
Copy link

LLParse commented Jan 5, 2017

@schuylr Maybe disaster recovery wasn't working very well back in v2.3.7-6... :)

In v2.3.7-11, the data container only exists so that users with a functional ephemeral deployment could upgrade to the persistent version. It actually has no use beyond that point, and we will be removing it in a future release.

NFSv4 and etcd 2.X don't mix, so if you are mounting NFS shares on the underlying host for use by etcd, well, don't do that. You can send the backups to an NFS share. See this wiki for etcd operational instructions pertaining to disaster, backup creation/restoration, etc.

@schuylr
Copy link
Author

schuylr commented Jan 5, 2017

@LLParse Nope, no NFS here. It's the data container or host volumes from here. I'm already learning a bunch of etcd disaster recovery stuff.

I may have figured out the issue though. When Rancher 1.2.0 did the upgrade, it left a slew of old Stopped Rancher agents, and it keeps trying to start them back up. It's insanity, and the process queue is constantly full of hung processes. I'm going through all of the Stopped containers and removing them, which should hopefully free up the process queue so there's no flapping containers while the disaster node tries to restart.

Will keep you posted.

@LLParse
Copy link

LLParse commented Jan 5, 2017

Ugh, very odd.

I pushed out the updated etcd template that will work with Rancher 1.2, it comes with periodic backups and functional DR automation... if you have a data container that hasn't been deleted, you'll want to docker cp the data dir from the container to the host ASAP. Once Rancher infra services stabilize, the outlined backup process is what you'll want to follow.

@schuylr
Copy link
Author

schuylr commented Jan 6, 2017

Thanks! I built your image myself and it works great now that I managed to get some proper backed-up data into the etcd cluster, so I'll switch it over to your catalog entry once I have a chance.

@schuylr
Copy link
Author

schuylr commented Jan 9, 2017

@LLParse I switched back to the community catalog using the API in Rancher and it's not recognizing 2.3.7 as an upgrade path. Rancher doesn't consider dash appendage as an upgrade either (e.g. 2.3.7-rancher2) - can we use a different version name so it doesn't collide with previous versions?

@LLParse
Copy link

LLParse commented Jan 9, 2017

@schuylr Looks like we deprecated our homegrown version comparison logic in favor of github.com/blang/semver. I'll also have to remove the max_rancher_version restriction for the latest template to enable the upgrade path. That seems to be the only option in this case.

rancher/community-catalog#397

@schuylr
Copy link
Author

schuylr commented Jan 9, 2017

Thanks - I'll wait on this. If I've already mapped the /pdata folder to /var/etcd on my host machines, can I set BACKUP_LOCATION to /var/etcd to have the upgrade reliably use data.current?

@LLParse
Copy link

LLParse commented Jan 9, 2017

You can map both to /var/etcd, yes. The backups will not interfere with subsequent upgrades. Each backup gets its own timestamped directory. For example, in your setup you'll see /var/etcd/<timestamp>_etcd_<service_index> periodically created on one of the hosts.

@schuylr schuylr closed this as completed Mar 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants