Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to detect if a watch is active #559

Closed
maxweisel opened this issue Nov 24, 2020 · 22 comments
Closed

Unable to detect if a watch is active #559

maxweisel opened this issue Nov 24, 2020 · 22 comments

Comments

@maxweisel
Copy link

This is the same bug that occurred in the Go client library here: kubernetes/kubernetes#65012

Essentially if you create a watch stream and the connection is severed, the stream is unaware and believes it is open. It will stay open forever and just stop reporting events instead of trying to reconnect. This occurs anytime our k8s control plane is updated. Ideally it would detect the broken connection and reconnect to another replica or at least mark the connection closed.

It seems the solution is to occasionally http2 ping the connection to ensure that it's still open, and close if the ping is not received.

I've dug into the library a little and it seems watch.js uses request which is now deprecated. The watch class does allow specifying my own request implementation in the constructor, which I imagine I could use to add the ping feature, but I imagine this is a bug that will impact everyone who uses the watch event, so I figured I'd file an issue.

Happy to contribute a patch that includes this functionality, I'd love some guidance on what the preferred direction should be. Is there a suitable replacement for request that I could implement while I'm working on this patch?

Max

@brendandburns
Copy link
Contributor

Thanks, please see the discussion of alternatives to request here:

#414

The main problem with switching libraries is that all of this code is generated by the OpenAPI Generator project:
https://github.com/OpenAPITools/openapi-generator

We could change generators, but that would also result in significant breaking API changes in the library. I think that the best option is to fork the request generator in OpenAPI Generator (typescript-node) and switch it to a different HTTP library while maintaining the same generated API signatures.

This is a pretty significant undertaking though, and for now no one has taken it on. If there is interest, I'm happy to advise on what needs to be done. I don't have the cycles for it personally, but I can act as a guide.

Regarding hanging watches, I think the best answer is to set the socket timeout pretty low (e.g. 1 minute) which should cause the TCP connection to reset anyway.

@burn2delete
Copy link
Contributor

@brendandburns I would be willing to attempt migrating to a new generator.

@entropitor
Copy link

@maxweisel Thank you for discovering this! We have been plagued by this and I wasn't sure what the problem was

@jhagestedt It seems 1b313ce might be the culprit. Why is the timeout set to 0?

(Although I'm using 0.11.1 and I also seem to have the problem)

@entropitor
Copy link

I also wonder whether https://github.com/kubernetes-client/javascript/blame/master/src/cache.ts#L86 should maybe be before calling the error handlers so that if the error handlers immediately restart, it might end up in a stopped state because we first started (stopped = false) and then the error handler sets it back to "stopped"

@maxweisel
Copy link
Author

@brendanburns fwiw, I don't think setting the socket timeout to one minute is an acceptable solution here. Our control plane runs a very large number of pods and when the stream initially connects it will send the entire list of all running pods and their current states.

Our application fields requests by spinning up pods to do work. If our watch stream goes down, our service grinds to a halt as it can't confirmation that work for the request has started. For our service to run properly, I need to detect that the stream has gone down and reconnect within 5-10 seconds. I could set the timeout to 5 seconds, but that means we'll be pulling the entire list of pods every 5 seconds which is also too much data to comb through.

I believe the correct solution here is to support http2 ping and ping every X seconds on the stream which is what the kubernetes go client is doing. The pings are cheap and the stream won't reset if the connection is active.

@brendandburns
Copy link
Contributor

@flyboarder

I'd welcome the help.

The generator is here:

https://github.com/OpenAPITools/openapi-generator/tree/master/modules/openapi-generator/src/main/resources/typescript-node

I don't know if it is easier to clone the generator into a new directory and create a whole new generator, or to introduce a flag and just patch around the calls to request with calls to got or someother modern HTTP library.

Please take a look and let me know what you think.

Let's continue the discussion on #414 since that is the relevant issue and I don't want to hijack this issue.

@brendandburns
Copy link
Contributor

@entropitor that code hasn't been released in any published versions yet, so I don't think it is causing this issue. (we should make it configurable though)

@brendandburns
Copy link
Contributor

brendandburns commented Nov 28, 2020

@maxweisel if you want to re-implement the watch code to use http 2 I'd be glad to take a PR.

fwiw, though, the Kubernetes API was not intended to be an event bus. I think you will find that using a true event bus like Kafka is a more resilient architecture, alternately, you could register an admission controller that gets called whenever a Pod is created, which would get you out of the watch business entirely.

@maxweisel
Copy link
Author

Thanks for the notes. My use case is more complex than I let on and an admission controller is not a reasonable solution for me unfortunately :(

I'd be happy to throw an http2 ping PR together. A quick glance at request looks like I can't easily use http2 there. got looks great, but it uses http2-wrapper which makes accessing the http2session object needed to send a ping a little difficult.

I'm tempted to start with my own RequestInterface object that uses the raw http2 nodejs API, but I don't know if that's something you'd want upstream as it would mean watch would use a different request mechanism than all other requests in the library.

I'm going to experiment a bit and will report back.

@brendandburns
Copy link
Contributor

@maxweisel thanks for being willing to take on a PR. I don't mind if we use a totally separate HTTP2 API for watch, however we should definitely make sure that we can reuse all of the authentication and configuration code.

e.g. if you wanted to re-implement watch.ts with a native node HTTP2 I think that's fine, as long as the interfaces/apis defined in that file stay the same.

@entropitor
Copy link

What is the best way to move forward?

@brendanburns Do you see a way of fixing this bug this without having to go to HTTP2?

@geoffpatehutch
Copy link

geoffpatehutch commented Jan 19, 2021

I found this issue as we're seeing the same problem. Sorry this isn't a full PR fix, but as a slightly fudged workaround, we fixed the issue by wrapping the Watch and Informer in our own class and injecting an HTTP2 request (rather than using the default request implementation) which we can keep open and check the status of the underlying stream. Hope this example code helps someone:

`
const http2 = require("http2");
const k8s = require("@kubernetes/client-node");

function KubernetesPodCache(namespace) {
this.informer = null;
this.namespace = namespace;

const onError = this.onError;
const onAdd = this.onAdd;
const onUpdate = this.onUpdate;
const onDelete = this.onDelete;

const that = this;
const k8sConfig = new k8s.KubeConfig();
k8sConfig.loadFromDefault();

const k8sApi = k8sConfig.makeApiClient(k8s.CoreV1Api);

const listFn = () => { 
    return k8sApi.listNamespacedPod(this.namespace);
};

const http2Request = {
    webRequest(opts, callback) {
        const connectionOptions = {};
        k8sConfig.applyToRequest(connectionOptions);

        const url = new URL(opts.uri);
        const host = `${url.protocol}//${url.host}`;

        const http2ClientSession = http2.connect(host, { ca: connectionOptions.ca });
        http2ClientSession.on("error", (err) => {
            console.error(err);
        });

        let path = `/api/v1/namespaces/${that.namespace}/pods?watch=true`;
        if (opts && opts.qs && opts.qs.resourceVersion) {
            path += `&resourceVersion=${opts.qs.resourceVersion}`;
        }

        const requestHeaders = { ":method": "GET", ":path": path }; 
        requestHeaders.Authorization = connectionOptions.headers.Authorization;

        const requestOptions = {
            "endStream": false
        };
            
        const http2Stream = http2ClientSession.request(requestHeaders, requestOptions);
        let count = 0;
        const pingInterval = setInterval(() => {
            let payload = count.toString().padStart(8, "0");
            payload = payload.slice(payload.length - 8);
            http2ClientSession.ping(Buffer.from(payload), (error, duration, payload) => {
                if ((error || http2Stream.closed) && that.informer && !that.informer.stopped) {
                    console.log(`Stopping Informer and clearing listeners and ping interval... error: ${error}, http2Stream.closed: ${http2Stream.closed}`);
                    that.informer.off("error", onError);
                    that.informer.off("add", onAdd);
                    that.informer.off("update", onUpdate);
                    that.informer.off("delete", onDelete);
                    that.informer.stop();
                    clearInterval(pingInterval);
                    http2ClientSession.close();
                    if (!http2Stream.closed) {
                        http2Stream.emit("close");
                    }

                    console.log("...Informer Stopped. Will auto restart soon");
                }

                count++;
            });
        }, 2000);
        http2Stream.on("end", () => {
            http2Stream.close();
        });
        
        return http2Stream;
    }
};

const watch = new k8s.Watch(k8sConfig, http2Request);
this.informer = new k8s.ListWatch(`/api/v1/namespaces/${this.namespace}/pods`, watch, listFn, false);

this.start(onError, onAdd, onUpdate, onDelete);

setInterval(() => {
    if (this.informer && this.informer.stopped) {
        console.log("Detected Informer stopped - restarting...");
        this.start(onError, onAdd, onUpdate, onDelete);
        console.log("...Informer started.");
    }

}, 30000);

}

KubernetesPodCache.prototype.start = function (onError, onAdd, onUpdate, onDelete) {
this.informer.start();
this.informer.on("error", onError);
this.informer.on("add", onAdd);
this.informer.on("update", onUpdate);
this.informer.on("delete", onDelete);
}

KubernetesPodCache.prototype.onError = function (error) {
console.error(error);
}

KubernetesPodCache.prototype.onAdd = function (obj) {
console.log("add " + obj.metadata.name + ", " + obj.status.phase);
};

KubernetesPodCache.prototype.onUpdate = function (obj) {
console.log("update " + obj.metadata.name + ", " + obj.status.phase);
};

KubernetesPodCache.prototype.onDelete = function (obj) {
console.log("delete " + obj.metadata.name + ", " + obj.status.phase);
};
`

@arsnyder16
Copy link

@geoffpatehutch Have you seen any issues with the code above? I am running into a similar problem where the watch connection closed but the library still thinks its open.

@brendandburns Is there any thought to adjusting the implementation simply for watch?

@arsnyder16
Copy link

@brendandburns I took @geoffpatehutch idea here and made a more general purpose drop in replacement for the request implementation for watches, i am goign to test this out the next few days. I realize there has been work in master changing the signature of this so it no longer takes a callback. what i have below matches the latest release 0.12.1

const http2 = require('http2');

const queryString = require('query-string');


class WatchRequest {
  webRequest(options, callback) {
    const {
      ca, headers, method, qs, uri
    } = options;
    const url = new URL(uri);
    const session = http2.connect(url, { ca });
    let ping = null;
    let error = '';
    session.on('error', err => error += err);
    session.on('close', () => {
      clearInterval(ping);
      callback(error);
    });
    const stream = session.request({ 
      ...headers,
      ':method': method, 
      ':path'  : `${url.pathname}?${queryString.stringify(qs)}`,
      'accept' : 'application/json'
    }, { 'endStream': false });
    stream.setEncoding('utf8');
    ping = setInterval(() => {
      session.ping(error => {
        if (error || stream.closed) {
          clearInterval(ping);
          if (!session.destroyed) {
            session.destroy(error || 'stream was closed');
          } else {
            console.error('session was already destroyed this is unexpected');
          }
        }
      });
    }, 2000);
    stream.on('error', () => {/* no opt this will allow session 'error' to be emitted instead of throwing an exception */});
    stream.on('close', () => {
      clearInterval(ping);
      session.close();
    });
    return stream;
  }
}

usage is just simply

const watch = new k8s.Watch(config, new WatchRequest());

@arsnyder16
Copy link

@brendandburns I was able to reproduce this fairly consistently in an AKS cluster, where the connection would seem to be open but in reality it was not

Curious if you experience the same:

https://github.com/arsnyder16/watch-rst-issue

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 16, 2021
@maxweisel
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2021
@maxweisel
Copy link
Author

I got sucked onto another project, but I'm finally getting around to working on a PR for this. @arsnyder16 I'm a big fan of your approach of using vanilla the built-in http2 module. did you happen to have a version of this that works with the new callback-less RequestInterface format?

@maxweisel
Copy link
Author

I've started looking into this one, and it appears this bug should already be fixed now potentially.

The fix comes from PR #630, which uses tcp keepalive to send an empty packet on a regular basis to check if the connection is alive. The net-keepalive dependency was dropped in a3e6c0a, which is awesome too. I'm a huge fan of this approach too if it means we don't need to use http2 ping frames, however, so far it hasn't been that easy.

At the moment, both the net-keepalive version and the updated version do not work for me on mac or linux. Some quick napkin math suggests it can take up to 5 minutes and 30 seconds if there's a default TCP_KEEP_CNT value of 10 and an initial 30 second delay, but even after 10 minutes the connection does not close for me.

I'm going to see if I can get tcp keepalive working on a vanilla socket over here, and once I've got that working, I'll backport it to this and we can finally close this issue.

Max

@maxweisel
Copy link
Author

I've done some more testing and can confirm that 0.14.3 works correctly on GKE! My test connects to a watch stream on a machine that I control. I manually start blocking tcp packets via iptables and in versions before 0.14.2 it will stay connected forever, but in 0.14.2 and 0.14.3 it will disconnect after 30 seconds when using Node 12 LTS and Node 14 LTS (I didn't test any others).

It's worth noting, it seems tcp keepalive mileage can vary wildly depending on your OS / kernel settings, so folks who are not using this with a linux image provided by a cloud provider may have issues with this (my vanilla Ubuntu Desktop VM does not work for example)

I believe we can also close #596 too, but I'll let @jkryl / @brendanburns handle that.

@entropitor
Copy link

@maxweisel Given it doesn't work with e.g. Ubuntu, shouldn't we consider this an open issue?

@maxweisel
Copy link
Author

Ubuntu Server images do work for me, it's purely the Ubuntu Desktop VM that did not work, but it's possible that was also due to my VM's virtual network adapter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants