Informer stops receiving new events after some time #4781

gyfora · 2023-01-19T13:30:23Z

Describe the bug

We are running an operator built on top of the Java Operator SDK, which uses fabric8 6.2.0 internally.

In some cases the informers simply stop receiving new events from Kubernetes and therefore the CR changes are not picked up by our operator. There are no errors in the logs and it never recovers from this.

Restarting the operator process always solves the issue.

We have seen this more frequently in the past while still using fabric8 5.x.x but recently we also seen it with fabric8 6.2.0.

Fabric8 Kubernetes Client version

6.2.0

Steps to reproduce

Cannot be deterministacally reproduced.

Expected behavior

We would expect the informers to recover in these cases

Runtime

other (please specify in additional context)

Kubernetes API Server version

other (please specify in additional context)

Environment

other (please specify in additional context)

Fabric8 Kubernetes Client Logs

No response

Additional context

Kubernetes v1.20.15

shawkins · 2023-01-19T13:47:36Z

Please check the logs for an indication that the informer has shutdown, or use the stopped CompletableFuture to register a handler onCompletion. Also check the number of informers / watches you have running - okhttp provides no feedback when the concurrent request limit is reached, and instead simply stops processing new requests. This is being addressed in 6.4.
I believe the only additional related change in 6.3/6.4 is #4641 which addresses a regression with missing initial events.

If the informer does not seem to be shutting down, nor have you exhausted the concurrent request limit we'll need a debug log or some other way of tracking down what you are seeing.

csviri · 2023-01-24T14:13:44Z

@gyfora can you specify on which resource type this happened?

gyfora · 2023-01-24T14:25:25Z

@gyfora can you specify on which resource type this happened?

This happened on a custom resource (FlinkDeployment) but in one instance we also observed the informer event source that stopped working for a regular Deployment

gyfora · 2023-02-09T10:26:11Z

The debug logs are not very helpful, there are no errors from the informers. The most we see from fabric8 are the following:

io.fabric8.kubernetes.client.informers.impl.cache.Reflector
05:32:40 -  Watch restarting due to http gone for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments
05:32:41 - Listing items (0) for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments at v3097113831
05:32:41 - Starting watcher for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments at v3097113831
05:32:41 - Watch started for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments

I see these every once in a while for different namespaces and resource types.

We don't see other logs when we don't get events, we only see that we didnt get the event received logs that we usually do :)

gyfora · 2023-02-09T10:28:38Z

Also the above mentioned logs are from fabric8 6.4.0

shawkins · 2023-02-09T12:58:38Z

05:32:40 - Watch restarting due to http gone for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments

Likely implies that bookmarks aren't supported.

05:32:41 - Listing items (0) for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments at v3097113831

This should a be a full / fresh listing. It doesn't see any flinkdeployments at that time - is that expected / correct?

We don't see other logs when we don't get events, we only see that we didnt get the event received logs that we usually do :)

So by the logs the watch is running. Assuming there is another restart due to http gone, do you see the items exist at that time? Or like the previous logs is it always a 0 count?

gyfora · 2023-02-09T13:51:48Z

Sorry, this log was from a namespace which did not have any flinkdeployment resources. But other namespaces show the count:

Listing items (9) for apps/v1/namespaces/flink/deployments at v21222048381
...
Listing items (9) for flink.apache.org/v1beta1/namespaces/flink/flinkdeployments at v21222045844
...

However these restarts seem to happen randomly, not for all namespaces. Once a day or sometimes a little more frequently

shawkins · 2023-02-09T14:04:10Z

Sorry, this log was from a namespace which did not have any flinkdeployment resources. But other namespaces show the count:

And prior to that there were events fro those deployments but they were not seen by the watch?

If that's the case it does sound similar to kubernetes/kubernetes#102464 - where the api server would stop delivering events after a while.

Ideally you should be able to reproduce this for a given resource / kube version outside of the fabric8 client - would it be possible to try something like kubernetes/kubernetes#102464 (comment) for your resource to confirm that there is an api server issue.

Something that should help with this is #4675 - which would shorten the life of the watches - the proper fix was hard to implement given our reliance on mock expectations. I'll revisit it - it should be simpler to just have our watch manager just kill the watches after a while. However if the issue is not related to the watch timeout, this will not help.

However these restarts seem to happen randomly, not for all namespaces. Once a day or sometimes a little more frequently

It will take until the resourceVersion becomes "too old" which generally will be due to all modifications happening on the cluster which increment resourceVersion.

gyfora · 2023-02-09T17:41:52Z

@shawkins unfortunately it seems like our environment already contains the fix for the issue you linked. It would be really great if we had a simple way to just kill/reset the informers / watches periodically if they dont get any events.

I know that doesn't really help with finding the root cause but this has a pretty big impact on us and we are not really sure how to debug it further.

shawkins · 2023-02-09T17:54:13Z

unfortunately it seems like our environment already contains the fix for the issue you linked.

If you don't see any logs from the AbstractWatchManager or WatcherWebSocketListener about reconnecting in between the resource too old exceptions, then a very similar issue is at play even if it's not the same one.

It would be really great if we had a simple way to just kill/reset the informers / watches periodically if they dont get any events.

#4675 will limit the amount of time a watch runs to 5 - 10 minutes, but obviously won't be immediately available.

You can also a try to adjust your api server's watch timeout setting - see https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ --min-request-timeout. It defaults to 60 minutes, which means that fabric8 watches will run for 1-2 hours until they are terminated by kuberentes - so again if the watches are running longer than 2 hours and you don't see a server initiated reconnect, but no events, then it's definitely another api-server bug.

Or, if it's not too much down time, you can always just terminate / restart your entire process every hour.

I know that doesn't really help with finding the root cause but this has a pretty big impact on us and we are not really sure how to debug it further.

Perhaps just alter the go sample attached to the kuberenetes issue without a timeout, and see if / when it stops receiving events.

manusa · 2023-02-20T13:52:09Z

#4675 should take care of this, as soon as 6.5.0 is available this issue should be solved. Tomorrow's SNAPSHOT build should contain the fix too.

gyfora · 2023-02-20T13:54:03Z

Great! Thank you very much for taking care of this @shawkins @manusa

manusa · 2023-03-11T04:27:58Z

v6.5.0 was just released, could you please confirm if this is working now so we can close the issue

MichaelVoegele · 2023-03-23T09:32:00Z

Hello @gyfora
Have you already used v6.5.x and can you confirm that this issue has been fixed?
I am using v6.1.1 and I am experiencing the same issue, I have a simple operator watching specific kind of ConfigMaps (with a specific label) and the informers are stopping to get events after some time without any obvious reason (nothing in the logs / no exceptions / ...). It happens very rarely, we have about 15 kube clusters and the operator is deployed to each of them, sometimes the operators run for weeks without issues, then suddenly stop receiving events in some cluster, then pod needs to be restarted.
Thanks for a confirmation.
Regards

gyfora · 2023-03-23T09:36:02Z

We are in the process to upgrade to the newer version @MichaelVoegele so cannot confirm yet. It may be best if you try it yourself if you can

MichaelVoegele · 2023-03-23T09:40:16Z

Yep @gyfora, I just decided to do that and use v6.5.1. Cool, looking forward to see this fixed :-). Thanks guys for your awesome work with this fabric8 client!

shawkins mentioned this issue Jan 19, 2023

InformerEventSource cannot find resource after some time operator-framework/java-operator-sdk#1723

Closed

csviri mentioned this issue Jan 20, 2023

Explicitly suppor handling missed events for Informers operator-framework/java-operator-sdk#1727

Closed

manusa added the Waiting on feedback Issues that require feedback from User/Other community members label Jan 24, 2023

csviri mentioned this issue Feb 10, 2023

Investaigating informer related issue for reported by users of JOSDK snowdrop/team#999

Closed

manusa added this to the 6.5.0 milestone Feb 20, 2023

manusa closed this as completed Mar 15, 2023

csviri mentioned this issue Mar 18, 2024

where resources already exist but requests are generated by CREATE operator-framework/java-operator-sdk#2290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Informer stops receiving new events after some time #4781

Informer stops receiving new events after some time #4781

gyfora commented Jan 19, 2023

shawkins commented Jan 19, 2023

csviri commented Jan 24, 2023

gyfora commented Jan 24, 2023

gyfora commented Feb 9, 2023

gyfora commented Feb 9, 2023

shawkins commented Feb 9, 2023

gyfora commented Feb 9, 2023

shawkins commented Feb 9, 2023

gyfora commented Feb 9, 2023

shawkins commented Feb 9, 2023

manusa commented Feb 20, 2023

gyfora commented Feb 20, 2023

manusa commented Mar 11, 2023

MichaelVoegele commented Mar 23, 2023

gyfora commented Mar 23, 2023

MichaelVoegele commented Mar 23, 2023

Informer stops receiving new events after some time #4781

Informer stops receiving new events after some time #4781

Comments

gyfora commented Jan 19, 2023

Describe the bug

Fabric8 Kubernetes Client version

Steps to reproduce

Expected behavior

Runtime

Kubernetes API Server version

Environment

Fabric8 Kubernetes Client Logs

Additional context

shawkins commented Jan 19, 2023

csviri commented Jan 24, 2023

gyfora commented Jan 24, 2023

gyfora commented Feb 9, 2023

gyfora commented Feb 9, 2023

shawkins commented Feb 9, 2023

gyfora commented Feb 9, 2023

shawkins commented Feb 9, 2023

gyfora commented Feb 9, 2023

shawkins commented Feb 9, 2023

manusa commented Feb 20, 2023

gyfora commented Feb 20, 2023

manusa commented Mar 11, 2023

MichaelVoegele commented Mar 23, 2023

gyfora commented Mar 23, 2023

MichaelVoegele commented Mar 23, 2023