Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Informer stops receiving new events after some time #4781

Closed
gyfora opened this issue Jan 19, 2023 · 16 comments
Closed

Informer stops receiving new events after some time #4781

gyfora opened this issue Jan 19, 2023 · 16 comments
Labels
Waiting on feedback Issues that require feedback from User/Other community members
Milestone

Comments

@gyfora
Copy link

gyfora commented Jan 19, 2023

Describe the bug

We are running an operator built on top of the Java Operator SDK, which uses fabric8 6.2.0 internally.

In some cases the informers simply stop receiving new events from Kubernetes and therefore the CR changes are not picked up by our operator. There are no errors in the logs and it never recovers from this.

Restarting the operator process always solves the issue.

We have seen this more frequently in the past while still using fabric8 5.x.x but recently we also seen it with fabric8 6.2.0.

Fabric8 Kubernetes Client version

6.2.0

Steps to reproduce

Cannot be deterministacally reproduced.

Expected behavior

We would expect the informers to recover in these cases

Runtime

other (please specify in additional context)

Kubernetes API Server version

other (please specify in additional context)

Environment

other (please specify in additional context)

Fabric8 Kubernetes Client Logs

No response

Additional context

Kubernetes v1.20.15

@shawkins
Copy link
Contributor

Please check the logs for an indication that the informer has shutdown, or use the stopped CompletableFuture to register a handler onCompletion. Also check the number of informers / watches you have running - okhttp provides no feedback when the concurrent request limit is reached, and instead simply stops processing new requests. This is being addressed in 6.4.
I believe the only additional related change in 6.3/6.4 is #4641 which addresses a regression with missing initial events.

If the informer does not seem to be shutting down, nor have you exhausted the concurrent request limit we'll need a debug log or some other way of tracking down what you are seeing.

@csviri
Copy link
Contributor

csviri commented Jan 24, 2023

@gyfora can you specify on which resource type this happened?

@gyfora
Copy link
Author

gyfora commented Jan 24, 2023

@gyfora can you specify on which resource type this happened?

This happened on a custom resource (FlinkDeployment) but in one instance we also observed the informer event source that stopped working for a regular Deployment

@gyfora
Copy link
Author

gyfora commented Feb 9, 2023

The debug logs are not very helpful, there are no errors from the informers. The most we see from fabric8 are the following:

io.fabric8.kubernetes.client.informers.impl.cache.Reflector
05:32:40 -  Watch restarting due to http gone for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments
05:32:41 - Listing items (0) for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments at v3097113831
05:32:41 - Starting watcher for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments at v3097113831
05:32:41 - Watch started for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments

I see these every once in a while for different namespaces and resource types.

We don't see other logs when we don't get events, we only see that we didnt get the event received logs that we usually do :)

@gyfora
Copy link
Author

gyfora commented Feb 9, 2023

Also the above mentioned logs are from fabric8 6.4.0

@shawkins
Copy link
Contributor

shawkins commented Feb 9, 2023

05:32:40 - Watch restarting due to http gone for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments

Likely implies that bookmarks aren't supported.

05:32:41 - Listing items (0) for flink.apache.org/v1beta1/namespaces/flink-system/flinkdeployments at v3097113831

This should a be a full / fresh listing. It doesn't see any flinkdeployments at that time - is that expected / correct?

We don't see other logs when we don't get events, we only see that we didnt get the event received logs that we usually do :)

So by the logs the watch is running. Assuming there is another restart due to http gone, do you see the items exist at that time? Or like the previous logs is it always a 0 count?

@gyfora
Copy link
Author

gyfora commented Feb 9, 2023

Sorry, this log was from a namespace which did not have any flinkdeployment resources. But other namespaces show the count:

Listing items (9) for apps/v1/namespaces/flink/deployments at v21222048381
...
Listing items (9) for flink.apache.org/v1beta1/namespaces/flink/flinkdeployments at v21222045844
...

However these restarts seem to happen randomly, not for all namespaces. Once a day or sometimes a little more frequently

@shawkins
Copy link
Contributor

shawkins commented Feb 9, 2023

Sorry, this log was from a namespace which did not have any flinkdeployment resources. But other namespaces show the count:

And prior to that there were events fro those deployments but they were not seen by the watch?

If that's the case it does sound similar to kubernetes/kubernetes#102464 - where the api server would stop delivering events after a while.

Ideally you should be able to reproduce this for a given resource / kube version outside of the fabric8 client - would it be possible to try something like kubernetes/kubernetes#102464 (comment) for your resource to confirm that there is an api server issue.

Something that should help with this is #4675 - which would shorten the life of the watches - the proper fix was hard to implement given our reliance on mock expectations. I'll revisit it - it should be simpler to just have our watch manager just kill the watches after a while. However if the issue is not related to the watch timeout, this will not help.

However these restarts seem to happen randomly, not for all namespaces. Once a day or sometimes a little more frequently

It will take until the resourceVersion becomes "too old" which generally will be due to all modifications happening on the cluster which increment resourceVersion.

@gyfora
Copy link
Author

gyfora commented Feb 9, 2023

@shawkins unfortunately it seems like our environment already contains the fix for the issue you linked. It would be really great if we had a simple way to just kill/reset the informers / watches periodically if they dont get any events.

I know that doesn't really help with finding the root cause but this has a pretty big impact on us and we are not really sure how to debug it further.

@shawkins
Copy link
Contributor

shawkins commented Feb 9, 2023

unfortunately it seems like our environment already contains the fix for the issue you linked.

If you don't see any logs from the AbstractWatchManager or WatcherWebSocketListener about reconnecting in between the resource too old exceptions, then a very similar issue is at play even if it's not the same one.

It would be really great if we had a simple way to just kill/reset the informers / watches periodically if they dont get any events.

#4675 will limit the amount of time a watch runs to 5 - 10 minutes, but obviously won't be immediately available.

You can also a try to adjust your api server's watch timeout setting - see https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ --min-request-timeout. It defaults to 60 minutes, which means that fabric8 watches will run for 1-2 hours until they are terminated by kuberentes - so again if the watches are running longer than 2 hours and you don't see a server initiated reconnect, but no events, then it's definitely another api-server bug.

Or, if it's not too much down time, you can always just terminate / restart your entire process every hour.

I know that doesn't really help with finding the root cause but this has a pretty big impact on us and we are not really sure how to debug it further.

Perhaps just alter the go sample attached to the kuberenetes issue without a timeout, and see if / when it stops receiving events.

@manusa
Copy link
Member

manusa commented Feb 20, 2023

#4675 should take care of this, as soon as 6.5.0 is available this issue should be solved. Tomorrow's SNAPSHOT build should contain the fix too.

@manusa manusa added this to the 6.5.0 milestone Feb 20, 2023
@gyfora
Copy link
Author

gyfora commented Feb 20, 2023

Great! Thank you very much for taking care of this @shawkins @manusa

@manusa
Copy link
Member

manusa commented Mar 11, 2023

v6.5.0 was just released, could you please confirm if this is working now so we can close the issue

@manusa manusa closed this as completed Mar 15, 2023
@MichaelVoegele
Copy link

Hello @gyfora
Have you already used v6.5.x and can you confirm that this issue has been fixed?
I am using v6.1.1 and I am experiencing the same issue, I have a simple operator watching specific kind of ConfigMaps (with a specific label) and the informers are stopping to get events after some time without any obvious reason (nothing in the logs / no exceptions / ...). It happens very rarely, we have about 15 kube clusters and the operator is deployed to each of them, sometimes the operators run for weeks without issues, then suddenly stop receiving events in some cluster, then pod needs to be restarted.
Thanks for a confirmation.
Regards

@gyfora
Copy link
Author

gyfora commented Mar 23, 2023

We are in the process to upgrade to the newer version @MichaelVoegele so cannot confirm yet. It may be best if you try it yourself if you can

@MichaelVoegele
Copy link

Yep @gyfora, I just decided to do that and use v6.5.1. Cool, looking forward to see this fixed :-). Thanks guys for your awesome work with this fabric8 client!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting on feedback Issues that require feedback from User/Other community members
Projects
None yet
Development

No branches or pull requests

5 participants