Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

session.Done() returns even when all etcd servers in a cluster are online #8181

Closed
sangleganesh opened this issue Jun 27, 2017 · 12 comments
Closed

Comments

@sangleganesh
Copy link

We followed the code in #6699 and are using concurrency.Session to detect loss of connectivity to all etcd servers.

The way we do it is to start a watch and in the watch function start a Session.
The assumption is that if all etcd servers in a cluster go down Session.Done() returns indicating loss of connection to etcd and we exit the process.

In the following code, sometime (not always) as soon as the process restarts (about 3-8 seconds), session.Done() returns an error despite the fact the all servers in the etcd cluster are alive.
How can I log more information about it - the reason Done was returned ?
Can it be called if etcd client switches connection to some other etcd server in cluster ?

Note that we moved the etcd client code to 3.2.0 release so that it could do load balancing (and automatically move the connections without disrupting the watches) to other etcd servers when an etcd server goes down.

Any help is greatly appreciated.

The etcd server version is: etcdv3.03
client library version: 3.2.0

func watchFn(key string) error {

session, err := concurrency.NewSession(
        kvClient,
        concurrency.WithTTL(120))
    if err != nil {
        logrus.Errorf("Failed to establish session for etcd client watch: %v", err)
        return err
    }


 ctx, watchCancel := context.WithCancel(context.Background())
    watchRet := make(chan error)
    watchChan := et.kvClient.Watch(ctx, key, opts...)
    go func() {
        for wresp := range watchChan {
            if wresp.Created == true {
                continue
            }
            if wresp.Canceled == true {
                // Watch is canceled. Notify the watcher
                logrus.Errorf("Watch on key %v cancelled. Error: %v", key,
                    wresp.Err())
                // return error
            } else {
                for _, ev := range wresp.Events {
                // handle watch update
                }
            }
        }
    }()

     select {
    case <-session.Done(): // closed by etcd
        // Close the context
        watchCancel()
        // Indicate the caller that watch has been canceled
        logrus.Errorf("Watch closing session")
    case err := <-watchRet: // error in watcher
        logrus.Errorf("Watch for %v stopped: %v", key, err)
    }

   return nil
}

Opening an issue as suggested in this forum:
https://groups.google.com/forum/?hl=en#!topic/etcd-dev/0qXtLqRfLTk

@heyitsanthony
Copy link
Contributor

How can I log more information about it - the reason Done was returned ?

The channel for Done() closes if the client does not receive a keepalive response from the etcd server within the TTL limit.

Can it be called if etcd client switches connection to some other etcd server in cluster ?

If it switches to a member that is partitioned from the rest of the cluster then it could possibly time out waiting for keepalives.

The etcd server version is: etcdv3.03

3.0.3? There have been some lease fixes to 3.0.x since then:
395bd23 (possibly hitting this one)
2e92779
07e421d
842145e

@sangleganesh
Copy link
Author

thanks @heyitsanthony

  1. The Done() was called within 3-8 seconds - (and the TTL was 120 (120 seconds)).
  2. The etcd cluster was healthy when session was closed
  3. Will retry with latest versions of etcd server.

@heyitsanthony heyitsanthony self-assigned this Jul 18, 2017
@xiang90 xiang90 added this to the v3.4.0 milestone Nov 26, 2017
@gyuho
Copy link
Contributor

gyuho commented Dec 22, 2017

@sangleganesh Any updates?

@sangleganesh
Copy link
Author

we just worked around the issue. closing the bug.

@xiang90 xiang90 reopened this Dec 22, 2017
@xiang90
Copy link
Contributor

xiang90 commented Dec 22, 2017

@sangleganesh I am not happy with workaround. If this is an issue, we should get it fixed.

@gyuho
Copy link
Contributor

gyuho commented Dec 22, 2017

@sangleganesh Have you tried the latest etcd?

@dvonthenen
Copy link

dvonthenen commented Feb 15, 2018

@sangleganesh @gyuho @xiang90 I have verified the issue with the reported server version v3.0.3 and client library version v3.2.0.

I have also retested using the server and client version v3.3.1 and the issue is reproducible in that version as well. Falls into the case <-ctx.Done(): // user cancel where the error is context deadline exceeded. Looking into this a little more.

@xiang90
Copy link
Contributor

xiang90 commented Feb 15, 2018

@dvonthenen

great. thanks for looking into this.

@tydra-wang
Copy link

maybe the same problem.

I have a three-node etcd cluster. session.Done() in client often returns when I kill one node of etcd.

@sakateka
Copy link
Contributor

I have investigated the given issue.
This is due to the classic data race in a distributed system.

The problem is reproducible even in the latest 3.5.0 release, for reproduction it is enough to create a client and try to get a sufficiently large number of sessions (to get a lease), the problem may appear even when getting the first lease
These steps are automated here

At some point, a situation will occur in the cluster in which the master will send a request to commit changes to the lease repository , but he himself has not yet applied these changes.
Before applying the changes on the master, a request comes from the client with a request to get the TTL of this lease (but the master has not yet added it to himself), which will eventually lead to an error - the lease was not found.

In my view, there are two solutions to this problem.
1.good - do not end LeaseGrant call until changes are applied to the master
2. bad, which will only hide the problem - when creating a session, make the first request for renewal of the lease after half of the lease lifetime.

@serathius serathius removed this from the etcd-v3.5 milestone Dec 15, 2021
@serathius
Copy link
Member

I think this was addressed by @ahrtr. Can you confirm?

@ahrtr
Copy link
Member

ahrtr commented Jun 13, 2022

This issue should have been resolved in 3.5.3 and the main branch. Please see 13932 .

Please feel free to reopen this issue if you still can reproduce this on etcd 3.5.3+.

@ahrtr ahrtr closed this as completed Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

9 participants