Umbrella issue for handling Informer Permission Errors and Probes #1422

csviri · 2022-08-25T11:15:36Z

This issue is to cover more related issues and come up with a design how to handle them, see:

Summary

Informers are by definition not working if related RBAC is not configured or miss configured.

Either when an operator starts up and there are no proper permissions or
when running and some permissions are revoked.

This is true both for informers for primary and secondary resources.

When a permission is revoked what happens is that Informer (from fabri8 client) will try retry few times but eventually just stops. (TODO verify this behavior, also configurability)

Note that framework however supports dynamically changing watched namespaces, so it can again happen a new namespace added but there are no permissions. It might be desirable to have the operator functioning on the former namespaces.

There are multiple ways to solve this:

Just stop/restart the operator when an informer is not started properly
Try to reconnect the informer (infinitely)
Expose metrics regarding the health of informers so readyness probes can be added that eventually kill the operator.

@andreaTP propsed that having a callback function might be useful too.

csviri · 2022-08-25T12:27:15Z

After thinking more of this issue, there is a serious problem when an informer is not working. The caches are not populated with the fresh resources, so until the RBAC issues is not solved it is basically could be inconsistent. For example there is a new primary resource that on reconciliation tries to create a secondary resource, if the informer of that secondary resource is not working the cache will not know about the resource so the reconciler will try to recreate it over and over again.

Therefore would lean in the first phase just to restart the operator if there is such an error with the Informer. Maybe allowing to users to configure some retry options for the informer before a stop.

There could be a feature flag to not automatically shut down the operator, but delegate this for probes and expose the health metrics of informers as mentioned in point 3.

csviri · 2022-08-25T12:29:16Z

For the case when dynamically adding watched namespaces, there could be some optimizations, like reconcile a resource only when all the informers watching resources from a namespaces are healthy. But these will be very specific cases, its questionable such features is worth the effort.

gyfora · 2022-08-25T12:36:32Z

I think when the informer for a specific namespace cannot be created, we need to remove that namespace from the config and clear the caches.

The general problem with the restarting/current behaviour on errors is that malicious users (or simple user error) can cripple the operator for everyone in the cluster. If the user deletes rolebinding in their own namespaces the operator will go in a restart loop and there is nothing anyone else can do to fix it.

The whole cluster and resources are affected this way and restart won't do any good.

csviri · 2022-08-25T13:14:43Z

@gyfora
So you proposing something like: if a new namespace is added dynamically in runtime, and there is an informer (among possible more new informers), that cannot be start because of some permission issues, basically this operation should reject the new namespace (throw an exception - we could check with k8s api is there such permission upfront) ?

How should we behave on startup? Do the checks also I guess on watched namespace.
So if I understand correctly you are saying that we should be kinda validate and/or pre-process configuration?

This basically could be done just for informers which are following the general namespace configuration. I guess for the others if failing (informers not following the namespace in primary controller config) we should basically stop the operator too.

Do you agree with the case when a permission revoked, and basically from that point one of the informers won't work, therefore the caches won't function properly, that we should restart the operator?

csviri · 2022-08-25T13:22:27Z

However validating the namespace list and related permissions, and filtering the namespace list, could be something outside of the operator.

gyfora · 2022-08-25T13:22:33Z

No I am not really proposing anything about newly added namespaces.

What I am proposing if at any time any watched namespaces becomes unaccessible we should be able to remove that from the list and clean up caches and continue with the rest.

I think the best would be to have this at least a configurable option (default could be fail-operator/restart as you proposed).

Basically we need to address the following situation robustly:

watchedNamespaces: userNs1, userNs2
User 1 decides to revoke permissions for userNs1
User 2 using userNs2 should not be affected by the action of user1 as they are not aware of each other

gyfora · 2022-08-25T13:24:59Z

Unless this problem is addressed there is a big risk for any operator deployment that watches multiple independent user namespaces. One user basically can send the whole operator in a restart loop, this is not something that many production platforms can afford

gyfora · 2022-08-25T13:25:28Z

In some cases it is better to not watch a namespace anymore than to affect other namespaces being watched.

gyfora · 2022-08-25T13:29:44Z

However validating the namespace list and related permissions, and filtering the namespace list, could be something outside of the operator.

Yes I guess if there is a liveness probe to restart then we as users of the framework could validate permissions. That would definitely be a step forward.

csviri · 2022-08-25T13:37:12Z

ok I understand what you mean. Good point.

However consider a situation: somebody accidentally mis-configures the rbac, and revokes just permission of a single resource in a namespace, while the other resources and related informers will still work on that namespace. Should we remove all the informers? if not than we have the mentioned problem above with the not fresh caches.

But it can also happen that that particular namespace was a special one where some configs for the operator ( not a team namespace).

So what I mean there is now no such thing in the model of the operator that Informers that are bound together - meaning if one of the stopped other ones should be stopped.

There is this flag what means informers should follow namespace changes of the controller. Theoretically we could say:
Stop all the informers that follows the controller namespaces if one of them fails (on a namespace). (Or just stop propagating event from them, and try to reconnect the one what is failing. Start all of them again if all up and running. )

This would make sense for me for the multi namespace mode. (Not the whole cluster or one namespace mode, there I think we still need to stop.)

csviri · 2022-08-25T13:47:20Z

The other alternative to solve both the issues, just retry staring the informer with some exponential backoff, and log warning. Assuming that it won't reconcile the resource anyways since there will be new event from that namespace anyways.

And admins will eventually remove the revoked namespace from the configuration.

gyfora · 2022-08-25T14:33:12Z

Yes I think the exponential backoff + warning is a good approach. In that case other informers/namespaces would not be affected.

csviri · 2022-08-25T14:41:23Z

Yes, that will cover this case in a relatively simple way.

But would put there a feature flag, by default will exit if one of the informers is stopped.

And later we can improve this by metrics and readyness probes.

cc @andreaTP

gyfora · 2022-08-25T14:43:11Z

sound great :)

everettraven · 2022-08-25T15:56:03Z

As discussed during the community meeting, here is all the PoC caching stuff I have been working on for how Go operators can handle dynamically changing permissions via RBAC:

everettraven/scoped-cache-poc
everettraven/scoped-operator-poc
- Specifically the branch test/scoped-cache

andreaTP · 2022-09-05T12:28:47Z

Another edge case that is relevant:
fabric8io/kubernetes-client#4388

csviri · 2022-10-25T12:16:38Z

see related issue: fabric8io/kubernetes-client#4527

andreaTP · 2022-11-23T13:24:50Z

🎉 great work!

csviri self-assigned this Aug 25, 2022

csviri changed the title ~~Umbrella issue for handling Informer Permission/Connectivity Errors and Probes~~ Umbrella issue for handling Informer Permission Errors and Probes Aug 25, 2022

csviri added kind/feature Categorizes issue or PR as related to a new feature. feature needs-discussion Issue needs to be discussed more before working on it labels Aug 25, 2022

This was referenced Aug 25, 2022

Support for K8S Probes #1412

Closed

Allow operator to start even if not all watched namespaces are accessible #1405

Closed

csviri added this to the 3.2 milestone Aug 25, 2022

andreaTP mentioned this issue Aug 29, 2022

Operator should trigger the error state of the CR when deserialization fails #1170

Closed

csviri removed the needs-discussion Issue needs to be discussed more before working on it label Aug 30, 2022

This was referenced Aug 31, 2022

Handling Forbidden (403) HTTP Status Related Shutdowns is Informers fabric8io/kubernetes-client#4369

Closed

feat: add possibility to add an exception handler to Watchers fabric8io/kubernetes-client#4365

Merged

csviri modified the milestones: 3.2, 3.3 Sep 5, 2022

csviri linked a pull request Sep 23, 2022 that will close this issue

feat: enable configuring a handler to listen to informers stopping #1493

Merged

csviri removed a link to a pull request Sep 23, 2022

feat: enable configuring a handler to listen to informers stopping #1493

Merged

csviri mentioned this issue Oct 26, 2022

feat: informer related behavior on startup and in case of errors #1571

Merged

csviri modified the milestones: 4.1, 4.2 Oct 26, 2022

csviri closed this as completed Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Umbrella issue for handling Informer Permission Errors and Probes #1422

Umbrella issue for handling Informer Permission Errors and Probes #1422

csviri commented Aug 25, 2022 •

edited

csviri commented Aug 25, 2022 •

edited

csviri commented Aug 25, 2022

gyfora commented Aug 25, 2022

csviri commented Aug 25, 2022 •

edited

csviri commented Aug 25, 2022

gyfora commented Aug 25, 2022

gyfora commented Aug 25, 2022

gyfora commented Aug 25, 2022

gyfora commented Aug 25, 2022

csviri commented Aug 25, 2022 •

edited

csviri commented Aug 25, 2022 •

edited

gyfora commented Aug 25, 2022

csviri commented Aug 25, 2022

gyfora commented Aug 25, 2022

everettraven commented Aug 25, 2022

andreaTP commented Sep 5, 2022

csviri commented Oct 25, 2022 •

edited

andreaTP commented Nov 23, 2022

Umbrella issue for handling Informer Permission Errors and Probes #1422

Umbrella issue for handling Informer Permission Errors and Probes #1422

Comments

csviri commented Aug 25, 2022 • edited

Summary

csviri commented Aug 25, 2022 • edited

csviri commented Aug 25, 2022

gyfora commented Aug 25, 2022

csviri commented Aug 25, 2022 • edited

csviri commented Aug 25, 2022

gyfora commented Aug 25, 2022

gyfora commented Aug 25, 2022

gyfora commented Aug 25, 2022

gyfora commented Aug 25, 2022

csviri commented Aug 25, 2022 • edited

csviri commented Aug 25, 2022 • edited

gyfora commented Aug 25, 2022

csviri commented Aug 25, 2022

gyfora commented Aug 25, 2022

everettraven commented Aug 25, 2022

andreaTP commented Sep 5, 2022

csviri commented Oct 25, 2022 • edited

andreaTP commented Nov 23, 2022

csviri commented Aug 25, 2022 •

edited

csviri commented Aug 25, 2022 •

edited

csviri commented Aug 25, 2022 •

edited

csviri commented Aug 25, 2022 •

edited

csviri commented Aug 25, 2022 •

edited

csviri commented Oct 25, 2022 •

edited