etcd Catalog Disaster Recovery is broken #7310

schuylr · 2017-01-05T14:59:53Z

Rancher Version: 1.2.2

Docker Version: 1.11.2

OS and where are the hosts located? (cloud, bare metal, etc): AWS EC2

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB) Single Node External DB

Environment Type: (Cattle/Kubernetes/Swarm/Mesos) Cattle

Steps to Reproduce:

This happened when I did the 1.2.0 environment upgrade - a lot of the Docker daemons locked up from a massive number of container addition/removals during the upgrade, causing Rancher to kill a majority of etcd nodes due to failed health checks.

Kill more than N/2 hosts to invoke a disaster. Try to start disaster recovery on a surviving node by executing shell and typing disaster

Results:

Disaster flag gets set, container shuts down, tries to start, but then gets killed by Rancher before a start can even be completed. The corresponding data container also gets killed :(

Expected:

A running disaster-recovery etcd node.

I have 2/9 etcd instances now running and no way to recover the remaining 7.

Update

I sacrificed one more node and got the following logs from the disaster:

2017-01-05 15:42:37.111372 N | osutil: received terminated signal, shutting down...
2017-01-05 15:42:37.114198 E | rafthttp: failed to read d309b12a4796d448 on stream MsgApp v2 (net/http: request canceled)
2017-01-05 15:42:37.114221 I | rafthttp: the connection with d309b12a4796d448 became inactive
/run.sh: line 67:   296 Terminated              etcd --name ${NAME} --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://${IP}:2379 --listen-peer-urls http://0.0.0.0:2380 --initial-advertise-peer-urls http://${IP}:2380 --initial-cluster-state existing --initial-cluster $cluster
time="2017-01-05T15:43:14Z" level=fatal msg="Error 404 accessing /self/stack/services/etcd path" 
time="2017-01-05T15:43:14Z" level=fatal msg="Failed to find IP: Error 404 accessing /self/container path" 
wget: server returned error: HTTP/1.1 404 Not Found
wget: server returned error: HTTP/1.1 404 Not Found
wget: server returned error: HTTP/1.1 404 Not Found
Creating a DR backup...
Sanitizing DR backup...
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
2017-01-05 15:43:15.474856 E | etcdmain: error verifying flags, '/data/data.20170105.154314.DR' is not a valid flag. See 'etcd --help'.
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused
Get http://127.0.0.1:2379/health: dial tcp 127.0.0.1:2379: getsockopt: connection refused

I'm pretty sure that Rancher deletes the data container before disaster recovery completes, causing the above to happen.

The text was updated successfully, but these errors were encountered:

LLParse · 2017-01-05T17:39:12Z

Hi @schuylr, it looks like metadata and possibly other infrastructure services are unstable. Since you mention containers flapping, Docker daemons locking up, I am guessing there is some sort of resource exhaustion occurring. I would recommend you tear down whatever user stacks and stabilize the infrastructure services first.

Unfortunately, the etcd (v2.3.7-6) in community-catalog is an ephemeral version - if you've lost all of the data containers, all state is lost forever. Our supported kubernetes stack uses a version which persists state to the host, FYI.

schuylr · 2017-01-05T18:25:19Z

@LLParse It was more of the known bugs that stem from moby/moby#13885 and coreos/bugs#1654 (I'm running CoreOS 1185.3.0) that caused it, in my opinion.

I have the data backed up from the survivors already and I'm starting from scratch. My workaround is to stop the service, drop in the backed up folder, set the disaster recovery flag with mkdir -p /data/DR, and then start the container again, though I'm getting some issues with unequal member counts right now.

It would be nice to see the community etcd support an option to allow us to select the ephemeral or host mapped one. I'm heavily invested in cattle at this point and don't want to change orchestration methods.

schuylr · 2017-01-05T18:35:11Z

Ok, so brand new etcd cluster breaks entirely. etcd-ha-etcd-1 works fine, but the second through Nth node all crap out with this:

Waiting for lower index nodes to all be active
OK
No etcd nodes available
2017-01-05 18:33:07.082108 I | flags: recognized and used environment variable ETCD_DATA_DIR=/data/data.current
2017-01-05 18:33:07.082280 I | etcdmain: etcd Version: 2.3.7
2017-01-05 18:33:07.082293 I | etcdmain: Git SHA: fd17c91
2017-01-05 18:33:07.082299 I | etcdmain: Go Version: go1.6.2
2017-01-05 18:33:07.082303 I | etcdmain: Go OS/Arch: linux/amd64
2017-01-05 18:33:07.082309 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2017-01-05 18:33:07.082390 I | etcdmain: listening for peers on http://0.0.0.0:2380
2017-01-05 18:33:07.082426 I | etcdmain: listening for client requests on http://0.0.0.0:2379
2017-01-05 18:33:07.126500 I | etcdmain: stopping listening for client requests on http://0.0.0.0:2379
2017-01-05 18:33:07.126535 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2017-01-05 18:33:07.126546 C | etcdmain: error validating peerURLs {ClusterID:e6327abfd2941b19 Members:[&{ID:28caae647f0cad1e RaftAttributes:{PeerURLs:[http://10.42.98.64:2380]} Attributes:{Name:etcd-ha-etcd-1 ClientURLs:[http://10.42.98.64:2379]}}] RemovedMemberIDs:[]}: member count is unequal

schuylr · 2017-01-05T19:14:10Z

The issue with the community-supported etcd cluster lies here in run.sh:

etcdctln() {
    target=0
    for j in $(seq 1 5); do
        for i in $(seq 1 $SCALE); do
            giddyup probe http://${STACK_NAME}_etcd_${i}:2379/health &> /dev/nul
            if [ "$?" == "0" ]; then
                target=$i
                break
            fi
        done
        if [ "$target" != "0" ]; then
            break
        fi
        sleep 1
    done
    if [ "$target" == "0" ]; then
        echo No etcd nodes available
    else
        etcdctl --endpoints http://${STACK_NAME}_etcd_$target:2379 $@
    fi
}

As you can see, the endpoint name is inferred to be ${STACK_NAME}_etcd_$target when it should be the new Rancher 1.2.x naming convention of ${STACK_NAME}-etcd-$target. Without this, giddyup probe fails due to a non-existent hostname.

It looks like you fixed it in rancher/catalog-dockerfiles#85

LLParse · 2017-01-05T21:41:27Z

Yep, you're right.. nice catch! It looks like this is fallout from RFC compliance refactor.

I will open up a PR to bring the latest rancher/etcd:v2.3.7-11 to the community catalog and will be available to you shortly.

schuylr · 2017-01-05T21:55:21Z

@LLParse Thanks! I'm doing further testing too with this build as well. I think disaster recovery won't work with the data containers, and I'm also partially worried that it won't work with mounted volumes either. In both scenarios, Rancher is still killing the recovered node and rescheduling it on another host :(

LLParse · 2017-01-05T22:36:35Z

@schuylr Maybe disaster recovery wasn't working very well back in v2.3.7-6... :)

In v2.3.7-11, the data container only exists so that users with a functional ephemeral deployment could upgrade to the persistent version. It actually has no use beyond that point, and we will be removing it in a future release.

NFSv4 and etcd 2.X don't mix, so if you are mounting NFS shares on the underlying host for use by etcd, well, don't do that. You can send the backups to an NFS share. See this wiki for etcd operational instructions pertaining to disaster, backup creation/restoration, etc.

schuylr · 2017-01-05T22:40:46Z

@LLParse Nope, no NFS here. It's the data container or host volumes from here. I'm already learning a bunch of etcd disaster recovery stuff.

I may have figured out the issue though. When Rancher 1.2.0 did the upgrade, it left a slew of old Stopped Rancher agents, and it keeps trying to start them back up. It's insanity, and the process queue is constantly full of hung processes. I'm going through all of the Stopped containers and removing them, which should hopefully free up the process queue so there's no flapping containers while the disaster node tries to restart.

Will keep you posted.

LLParse · 2017-01-05T23:12:55Z

Ugh, very odd.

I pushed out the updated etcd template that will work with Rancher 1.2, it comes with periodic backups and functional DR automation... if you have a data container that hasn't been deleted, you'll want to docker cp the data dir from the container to the host ASAP. Once Rancher infra services stabilize, the outlined backup process is what you'll want to follow.

schuylr · 2017-01-06T14:33:51Z

Thanks! I built your image myself and it works great now that I managed to get some proper backed-up data into the etcd cluster, so I'll switch it over to your catalog entry once I have a chance.

schuylr · 2017-01-09T14:44:48Z

@LLParse I switched back to the community catalog using the API in Rancher and it's not recognizing 2.3.7 as an upgrade path. Rancher doesn't consider dash appendage as an upgrade either (e.g. 2.3.7-rancher2) - can we use a different version name so it doesn't collide with previous versions?

LLParse · 2017-01-09T19:21:02Z

@schuylr Looks like we deprecated our homegrown version comparison logic in favor of github.com/blang/semver. I'll also have to remove the max_rancher_version restriction for the latest template to enable the upgrade path. That seems to be the only option in this case.

rancher/community-catalog#397

schuylr · 2017-01-09T20:19:42Z

Thanks - I'll wait on this. If I've already mapped the /pdata folder to /var/etcd on my host machines, can I set BACKUP_LOCATION to /var/etcd to have the upgrade reliably use data.current?

LLParse · 2017-01-09T21:34:25Z

You can map both to /var/etcd, yes. The backups will not interfere with subsequent upgrades. Each backup gets its own timestamped directory. For example, in your setup you'll see /var/etcd/<timestamp>_etcd_<service_index> periodically created on one of the hosts.

schuylr closed this as completed Mar 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd Catalog Disaster Recovery is broken #7310

etcd Catalog Disaster Recovery is broken #7310

schuylr commented Jan 5, 2017 •

edited

LLParse commented Jan 5, 2017

schuylr commented Jan 5, 2017

schuylr commented Jan 5, 2017

schuylr commented Jan 5, 2017 •

edited

LLParse commented Jan 5, 2017

schuylr commented Jan 5, 2017

LLParse commented Jan 5, 2017

schuylr commented Jan 5, 2017

LLParse commented Jan 5, 2017

schuylr commented Jan 6, 2017

schuylr commented Jan 9, 2017

LLParse commented Jan 9, 2017

schuylr commented Jan 9, 2017

LLParse commented Jan 9, 2017

etcd Catalog Disaster Recovery is broken #7310

etcd Catalog Disaster Recovery is broken #7310

Comments

schuylr commented Jan 5, 2017 • edited

LLParse commented Jan 5, 2017

schuylr commented Jan 5, 2017

schuylr commented Jan 5, 2017

schuylr commented Jan 5, 2017 • edited

LLParse commented Jan 5, 2017

schuylr commented Jan 5, 2017

LLParse commented Jan 5, 2017

schuylr commented Jan 5, 2017

LLParse commented Jan 5, 2017

schuylr commented Jan 6, 2017

schuylr commented Jan 9, 2017

LLParse commented Jan 9, 2017

schuylr commented Jan 9, 2017

LLParse commented Jan 9, 2017

schuylr commented Jan 5, 2017 •

edited

schuylr commented Jan 5, 2017 •

edited