Unable to detect if a watch is active #559

maxweisel · 2020-11-24T18:52:32Z

This is the same bug that occurred in the Go client library here: kubernetes/kubernetes#65012

Essentially if you create a watch stream and the connection is severed, the stream is unaware and believes it is open. It will stay open forever and just stop reporting events instead of trying to reconnect. This occurs anytime our k8s control plane is updated. Ideally it would detect the broken connection and reconnect to another replica or at least mark the connection closed.

It seems the solution is to occasionally http2 ping the connection to ensure that it's still open, and close if the ping is not received.

I've dug into the library a little and it seems watch.js uses request which is now deprecated. The watch class does allow specifying my own request implementation in the constructor, which I imagine I could use to add the ping feature, but I imagine this is a bug that will impact everyone who uses the watch event, so I figured I'd file an issue.

Happy to contribute a patch that includes this functionality, I'd love some guidance on what the preferred direction should be. Is there a suitable replacement for request that I could implement while I'm working on this patch?

Max

brendandburns · 2020-11-25T16:23:36Z

Thanks, please see the discussion of alternatives to request here:

#414

The main problem with switching libraries is that all of this code is generated by the OpenAPI Generator project:
https://github.com/OpenAPITools/openapi-generator

We could change generators, but that would also result in significant breaking API changes in the library. I think that the best option is to fork the request generator in OpenAPI Generator (typescript-node) and switch it to a different HTTP library while maintaining the same generated API signatures.

This is a pretty significant undertaking though, and for now no one has taken it on. If there is interest, I'm happy to advise on what needs to be done. I don't have the cycles for it personally, but I can act as a guide.

Regarding hanging watches, I think the best answer is to set the socket timeout pretty low (e.g. 1 minute) which should cause the TCP connection to reset anyway.

burn2delete · 2020-11-26T21:29:24Z

@brendandburns I would be willing to attempt migrating to a new generator.

entropitor · 2020-11-27T09:06:54Z

@maxweisel Thank you for discovering this! We have been plagued by this and I wasn't sure what the problem was

@jhagestedt It seems 1b313ce might be the culprit. Why is the timeout set to 0?

(Although I'm using 0.11.1 and I also seem to have the problem)

entropitor · 2020-11-27T09:25:52Z

I also wonder whether https://github.com/kubernetes-client/javascript/blame/master/src/cache.ts#L86 should maybe be before calling the error handlers so that if the error handlers immediately restart, it might end up in a stopped state because we first started (stopped = false) and then the error handler sets it back to "stopped"

maxweisel · 2020-11-28T18:03:17Z

@brendanburns fwiw, I don't think setting the socket timeout to one minute is an acceptable solution here. Our control plane runs a very large number of pods and when the stream initially connects it will send the entire list of all running pods and their current states.

Our application fields requests by spinning up pods to do work. If our watch stream goes down, our service grinds to a halt as it can't confirmation that work for the request has started. For our service to run properly, I need to detect that the stream has gone down and reconnect within 5-10 seconds. I could set the timeout to 5 seconds, but that means we'll be pulling the entire list of pods every 5 seconds which is also too much data to comb through.

I believe the correct solution here is to support http2 ping and ping every X seconds on the stream which is what the kubernetes go client is doing. The pings are cheap and the stream won't reset if the connection is active.

brendandburns · 2020-11-28T21:25:02Z

@flyboarder

I'd welcome the help.

The generator is here:

https://github.com/OpenAPITools/openapi-generator/tree/master/modules/openapi-generator/src/main/resources/typescript-node

I don't know if it is easier to clone the generator into a new directory and create a whole new generator, or to introduce a flag and just patch around the calls to request with calls to got or someother modern HTTP library.

Please take a look and let me know what you think.

Let's continue the discussion on #414 since that is the relevant issue and I don't want to hijack this issue.

brendandburns · 2020-11-28T21:26:20Z

@entropitor that code hasn't been released in any published versions yet, so I don't think it is causing this issue. (we should make it configurable though)

brendandburns · 2020-11-28T21:27:13Z

@maxweisel if you want to re-implement the watch code to use http 2 I'd be glad to take a PR.

fwiw, though, the Kubernetes API was not intended to be an event bus. I think you will find that using a true event bus like Kafka is a more resilient architecture, alternately, you could register an admission controller that gets called whenever a Pod is created, which would get you out of the watch business entirely.

maxweisel · 2020-11-30T18:39:29Z

Thanks for the notes. My use case is more complex than I let on and an admission controller is not a reasonable solution for me unfortunately :(

I'd be happy to throw an http2 ping PR together. A quick glance at request looks like I can't easily use http2 there. got looks great, but it uses http2-wrapper which makes accessing the http2session object needed to send a ping a little difficult.

I'm tempted to start with my own RequestInterface object that uses the raw http2 nodejs API, but I don't know if that's something you'd want upstream as it would mean watch would use a different request mechanism than all other requests in the library.

I'm going to experiment a bit and will report back.

brendandburns · 2020-12-02T01:03:02Z

@maxweisel thanks for being willing to take on a PR. I don't mind if we use a totally separate HTTP2 API for watch, however we should definitely make sure that we can reuse all of the authentication and configuration code.

e.g. if you wanted to re-implement watch.ts with a native node HTTP2 I think that's fine, as long as the interfaces/apis defined in that file stay the same.

entropitor · 2020-12-22T11:04:21Z

What is the best way to move forward?

@brendanburns Do you see a way of fixing this bug this without having to go to HTTP2?

geoffpatehutch · 2021-01-19T10:44:35Z

I found this issue as we're seeing the same problem. Sorry this isn't a full PR fix, but as a slightly fudged workaround, we fixed the issue by wrapping the Watch and Informer in our own class and injecting an HTTP2 request (rather than using the default request implementation) which we can keep open and check the status of the underlying stream. Hope this example code helps someone:

`
const http2 = require("http2");
const k8s = require("@kubernetes/client-node");

function KubernetesPodCache(namespace) {
this.informer = null;
this.namespace = namespace;

const onError = this.onError;
const onAdd = this.onAdd;
const onUpdate = this.onUpdate;
const onDelete = this.onDelete;

const that = this;
const k8sConfig = new k8s.KubeConfig();
k8sConfig.loadFromDefault();

const k8sApi = k8sConfig.makeApiClient(k8s.CoreV1Api);

const listFn = () => { 
    return k8sApi.listNamespacedPod(this.namespace);
};

const http2Request = {
    webRequest(opts, callback) {
        const connectionOptions = {};
        k8sConfig.applyToRequest(connectionOptions);

        const url = new URL(opts.uri);
        const host = `${url.protocol}//${url.host}`;

        const http2ClientSession = http2.connect(host, { ca: connectionOptions.ca });
        http2ClientSession.on("error", (err) => {
            console.error(err);
        });

        let path = `/api/v1/namespaces/${that.namespace}/pods?watch=true`;
        if (opts && opts.qs && opts.qs.resourceVersion) {
            path += `&resourceVersion=${opts.qs.resourceVersion}`;
        }

        const requestHeaders = { ":method": "GET", ":path": path }; 
        requestHeaders.Authorization = connectionOptions.headers.Authorization;

        const requestOptions = {
            "endStream": false
        };
            
        const http2Stream = http2ClientSession.request(requestHeaders, requestOptions);
        let count = 0;
        const pingInterval = setInterval(() => {
            let payload = count.toString().padStart(8, "0");
            payload = payload.slice(payload.length - 8);
            http2ClientSession.ping(Buffer.from(payload), (error, duration, payload) => {
                if ((error || http2Stream.closed) && that.informer && !that.informer.stopped) {
                    console.log(`Stopping Informer and clearing listeners and ping interval... error: ${error}, http2Stream.closed: ${http2Stream.closed}`);
                    that.informer.off("error", onError);
                    that.informer.off("add", onAdd);
                    that.informer.off("update", onUpdate);
                    that.informer.off("delete", onDelete);
                    that.informer.stop();
                    clearInterval(pingInterval);
                    http2ClientSession.close();
                    if (!http2Stream.closed) {
                        http2Stream.emit("close");
                    }

                    console.log("...Informer Stopped. Will auto restart soon");
                }

                count++;
            });
        }, 2000);
        http2Stream.on("end", () => {
            http2Stream.close();
        });
        
        return http2Stream;
    }
};

const watch = new k8s.Watch(k8sConfig, http2Request);
this.informer = new k8s.ListWatch(`/api/v1/namespaces/${this.namespace}/pods`, watch, listFn, false);

this.start(onError, onAdd, onUpdate, onDelete);

setInterval(() => {
    if (this.informer && this.informer.stopped) {
        console.log("Detected Informer stopped - restarting...");
        this.start(onError, onAdd, onUpdate, onDelete);
        console.log("...Informer started.");
    }

}, 30000);

}

KubernetesPodCache.prototype.start = function (onError, onAdd, onUpdate, onDelete) {
this.informer.start();
this.informer.on("error", onError);
this.informer.on("add", onAdd);
this.informer.on("update", onUpdate);
this.informer.on("delete", onDelete);
}

KubernetesPodCache.prototype.onError = function (error) {
console.error(error);
}

KubernetesPodCache.prototype.onAdd = function (obj) {
console.log("add " + obj.metadata.name + ", " + obj.status.phase);
};

KubernetesPodCache.prototype.onUpdate = function (obj) {
console.log("update " + obj.metadata.name + ", " + obj.status.phase);
};

KubernetesPodCache.prototype.onDelete = function (obj) {
console.log("delete " + obj.metadata.name + ", " + obj.status.phase);
};
`

arsnyder16 · 2021-02-11T22:16:38Z

@geoffpatehutch Have you seen any issues with the code above? I am running into a similar problem where the watch connection closed but the library still thinks its open.

@brendandburns Is there any thought to adjusting the implementation simply for watch?

arsnyder16 · 2021-02-12T14:17:02Z

@brendandburns I took @geoffpatehutch idea here and made a more general purpose drop in replacement for the request implementation for watches, i am goign to test this out the next few days. I realize there has been work in master changing the signature of this so it no longer takes a callback. what i have below matches the latest release 0.12.1

const http2 = require('http2');

const queryString = require('query-string');


class WatchRequest {
  webRequest(options, callback) {
    const {
      ca, headers, method, qs, uri
    } = options;
    const url = new URL(uri);
    const session = http2.connect(url, { ca });
    let ping = null;
    let error = '';
    session.on('error', err => error += err);
    session.on('close', () => {
      clearInterval(ping);
      callback(error);
    });
    const stream = session.request({ 
      ...headers,
      ':method': method, 
      ':path'  : `${url.pathname}?${queryString.stringify(qs)}`,
      'accept' : 'application/json'
    }, { 'endStream': false });
    stream.setEncoding('utf8');
    ping = setInterval(() => {
      session.ping(error => {
        if (error || stream.closed) {
          clearInterval(ping);
          if (!session.destroyed) {
            session.destroy(error || 'stream was closed');
          } else {
            console.error('session was already destroyed this is unexpected');
          }
        }
      });
    }, 2000);
    stream.on('error', () => {/* no opt this will allow session 'error' to be emitted instead of throwing an exception */});
    stream.on('close', () => {
      clearInterval(ping);
      session.close();
    });
    return stream;
  }
}

usage is just simply

const watch = new k8s.Watch(config, new WatchRequest());

arsnyder16 · 2021-02-15T02:41:03Z

@brendandburns I was able to reproduce this fairly consistently in an AKS cluster, where the connection would seem to be open but in reality it was not

Curious if you experience the same:

https://github.com/arsnyder16/watch-rst-issue

fejta-bot · 2021-05-16T02:49:16Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

maxweisel · 2021-05-17T13:59:56Z

/remove-lifecycle stale

maxweisel · 2021-05-30T23:43:47Z

I got sucked onto another project, but I'm finally getting around to working on a PR for this. @arsnyder16 I'm a big fan of your approach of using vanilla the built-in http2 module. did you happen to have a version of this that works with the new callback-less RequestInterface format?

maxweisel · 2021-06-03T19:12:52Z

I've started looking into this one, and it appears this bug should already be fixed now potentially.

The fix comes from PR #630, which uses tcp keepalive to send an empty packet on a regular basis to check if the connection is alive. The net-keepalive dependency was dropped in a3e6c0a, which is awesome too. I'm a huge fan of this approach too if it means we don't need to use http2 ping frames, however, so far it hasn't been that easy.

At the moment, both the net-keepalive version and the updated version do not work for me on mac or linux. Some quick napkin math suggests it can take up to 5 minutes and 30 seconds if there's a default TCP_KEEP_CNT value of 10 and an initial 30 second delay, but even after 10 minutes the connection does not close for me.

I'm going to see if I can get tcp keepalive working on a vanilla socket over here, and once I've got that working, I'll backport it to this and we can finally close this issue.

Max

maxweisel · 2021-06-04T16:36:51Z

I've done some more testing and can confirm that 0.14.3 works correctly on GKE! My test connects to a watch stream on a machine that I control. I manually start blocking tcp packets via iptables and in versions before 0.14.2 it will stay connected forever, but in 0.14.2 and 0.14.3 it will disconnect after 30 seconds when using Node 12 LTS and Node 14 LTS (I didn't test any others).

It's worth noting, it seems tcp keepalive mileage can vary wildly depending on your OS / kernel settings, so folks who are not using this with a linux image provided by a cloud provider may have issues with this (my vanilla Ubuntu Desktop VM does not work for example)

I believe we can also close #596 too, but I'll let @jkryl / @brendanburns handle that.

entropitor · 2021-06-14T13:42:13Z

@maxweisel Given it doesn't work with e.g. Ubuntu, shouldn't we consider this an open issue?

maxweisel · 2021-06-14T14:56:55Z

Ubuntu Server images do work for me, it's purely the Ubuntu Desktop VM that did not work, but it's possible that was also due to my VM's virtual network adapter.

dominykas mentioned this issue Feb 25, 2021

Empty catch block in watch stream event handling #593

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 16, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2021

maxweisel closed this as completed Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to detect if a watch is active #559

Unable to detect if a watch is active #559

maxweisel commented Nov 24, 2020

brendandburns commented Nov 25, 2020

burn2delete commented Nov 26, 2020

entropitor commented Nov 27, 2020

entropitor commented Nov 27, 2020

maxweisel commented Nov 28, 2020

brendandburns commented Nov 28, 2020

brendandburns commented Nov 28, 2020

brendandburns commented Nov 28, 2020 •

edited

maxweisel commented Nov 30, 2020

brendandburns commented Dec 2, 2020

entropitor commented Dec 22, 2020

geoffpatehutch commented Jan 19, 2021 •

edited

arsnyder16 commented Feb 11, 2021

arsnyder16 commented Feb 12, 2021

arsnyder16 commented Feb 15, 2021

fejta-bot commented May 16, 2021

maxweisel commented May 17, 2021

maxweisel commented May 30, 2021

maxweisel commented Jun 3, 2021

maxweisel commented Jun 4, 2021

entropitor commented Jun 14, 2021

maxweisel commented Jun 14, 2021

Unable to detect if a watch is active #559

Unable to detect if a watch is active #559

Comments

maxweisel commented Nov 24, 2020

brendandburns commented Nov 25, 2020

burn2delete commented Nov 26, 2020

entropitor commented Nov 27, 2020

entropitor commented Nov 27, 2020

maxweisel commented Nov 28, 2020

brendandburns commented Nov 28, 2020

brendandburns commented Nov 28, 2020

brendandburns commented Nov 28, 2020 • edited

maxweisel commented Nov 30, 2020

brendandburns commented Dec 2, 2020

entropitor commented Dec 22, 2020

geoffpatehutch commented Jan 19, 2021 • edited

arsnyder16 commented Feb 11, 2021

arsnyder16 commented Feb 12, 2021

arsnyder16 commented Feb 15, 2021

fejta-bot commented May 16, 2021

maxweisel commented May 17, 2021

maxweisel commented May 30, 2021

maxweisel commented Jun 3, 2021

maxweisel commented Jun 4, 2021

entropitor commented Jun 14, 2021

maxweisel commented Jun 14, 2021

brendandburns commented Nov 28, 2020 •

edited

geoffpatehutch commented Jan 19, 2021 •

edited