Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umbrella issue for handling Informer Permission Errors and Probes #1422

Closed
3 tasks done
csviri opened this issue Aug 25, 2022 · 18 comments
Closed
3 tasks done

Umbrella issue for handling Informer Permission Errors and Probes #1422

csviri opened this issue Aug 25, 2022 · 18 comments
Assignees
Labels
feature kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@csviri
Copy link
Collaborator

csviri commented Aug 25, 2022

This issue is to cover more related issues and come up with a design how to handle them, see:

Summary

Informers are by definition not working if related RBAC is not configured or miss configured.

  1. Either when an operator starts up and there are no proper permissions or
  2. when running and some permissions are revoked.

This is true both for informers for primary and secondary resources.

When a permission is revoked what happens is that Informer (from fabri8 client) will try retry few times but eventually just stops. (TODO verify this behavior, also configurability)

Note that framework however supports dynamically changing watched namespaces, so it can again happen a new namespace added but there are no permissions. It might be desirable to have the operator functioning on the former namespaces.

There are multiple ways to solve this:

  1. Just stop/restart the operator when an informer is not started properly
  2. Try to reconnect the informer (infinitely)
  3. Expose metrics regarding the health of informers so readyness probes can be added that eventually kill the operator.

@andreaTP propsed that having a callback function might be useful too.

@csviri csviri self-assigned this Aug 25, 2022
@csviri csviri changed the title Umbrella issue for handling Informer Permission/Connectivity Errors and Probes Umbrella issue for handling Informer Permission Errors and Probes Aug 25, 2022
@csviri csviri added kind/feature Categorizes issue or PR as related to a new feature. feature needs-discussion Issue needs to be discussed more before working on it labels Aug 25, 2022
@csviri
Copy link
Collaborator Author

csviri commented Aug 25, 2022

After thinking more of this issue, there is a serious problem when an informer is not working. The caches are not populated with the fresh resources, so until the RBAC issues is not solved it is basically could be inconsistent. For example there is a new primary resource that on reconciliation tries to create a secondary resource, if the informer of that secondary resource is not working the cache will not know about the resource so the reconciler will try to recreate it over and over again.

Therefore would lean in the first phase just to restart the operator if there is such an error with the Informer. Maybe allowing to users to configure some retry options for the informer before a stop.

There could be a feature flag to not automatically shut down the operator, but delegate this for probes and expose the health metrics of informers as mentioned in point 3.

@csviri
Copy link
Collaborator Author

csviri commented Aug 25, 2022

For the case when dynamically adding watched namespaces, there could be some optimizations, like reconcile a resource only when all the informers watching resources from a namespaces are healthy. But these will be very specific cases, its questionable such features is worth the effort.

@gyfora
Copy link

gyfora commented Aug 25, 2022

I think when the informer for a specific namespace cannot be created, we need to remove that namespace from the config and clear the caches.

The general problem with the restarting/current behaviour on errors is that malicious users (or simple user error) can cripple the operator for everyone in the cluster. If the user deletes rolebinding in their own namespaces the operator will go in a restart loop and there is nothing anyone else can do to fix it.

The whole cluster and resources are affected this way and restart won't do any good.

@csviri
Copy link
Collaborator Author

csviri commented Aug 25, 2022

@gyfora
So you proposing something like: if a new namespace is added dynamically in runtime, and there is an informer (among possible more new informers), that cannot be start because of some permission issues, basically this operation should reject the new namespace (throw an exception - we could check with k8s api is there such permission upfront) ?

How should we behave on startup? Do the checks also I guess on watched namespace.
So if I understand correctly you are saying that we should be kinda validate and/or pre-process configuration?

This basically could be done just for informers which are following the general namespace configuration. I guess for the others if failing (informers not following the namespace in primary controller config) we should basically stop the operator too.

Do you agree with the case when a permission revoked, and basically from that point one of the informers won't work, therefore the caches won't function properly, that we should restart the operator?

@csviri
Copy link
Collaborator Author

csviri commented Aug 25, 2022

However validating the namespace list and related permissions, and filtering the namespace list, could be something outside of the operator.

@gyfora
Copy link

gyfora commented Aug 25, 2022

No I am not really proposing anything about newly added namespaces.

What I am proposing if at any time any watched namespaces becomes unaccessible we should be able to remove that from the list and clean up caches and continue with the rest.

I think the best would be to have this at least a configurable option (default could be fail-operator/restart as you proposed).

Basically we need to address the following situation robustly:

  1. watchedNamespaces: userNs1, userNs2
  2. User 1 decides to revoke permissions for userNs1
  3. User 2 using userNs2 should not be affected by the action of user1 as they are not aware of each other

@gyfora
Copy link

gyfora commented Aug 25, 2022

Unless this problem is addressed there is a big risk for any operator deployment that watches multiple independent user namespaces. One user basically can send the whole operator in a restart loop, this is not something that many production platforms can afford

@gyfora
Copy link

gyfora commented Aug 25, 2022

In some cases it is better to not watch a namespace anymore than to affect other namespaces being watched.

@gyfora
Copy link

gyfora commented Aug 25, 2022

However validating the namespace list and related permissions, and filtering the namespace list, could be something outside of the operator.

Yes I guess if there is a liveness probe to restart then we as users of the framework could validate permissions. That would definitely be a step forward.

@csviri
Copy link
Collaborator Author

csviri commented Aug 25, 2022

ok I understand what you mean. Good point.

However consider a situation: somebody accidentally mis-configures the rbac, and revokes just permission of a single resource in a namespace, while the other resources and related informers will still work on that namespace. Should we remove all the informers? if not than we have the mentioned problem above with the not fresh caches.

But it can also happen that that particular namespace was a special one where some configs for the operator ( not a team namespace).

So what I mean there is now no such thing in the model of the operator that Informers that are bound together - meaning if one of the stopped other ones should be stopped.

There is this flag what means informers should follow namespace changes of the controller. Theoretically we could say:
Stop all the informers that follows the controller namespaces if one of them fails (on a namespace). (Or just stop propagating event from them, and try to reconnect the one what is failing. Start all of them again if all up and running. )

This would make sense for me for the multi namespace mode. (Not the whole cluster or one namespace mode, there I think we still need to stop.)

@csviri
Copy link
Collaborator Author

csviri commented Aug 25, 2022

The other alternative to solve both the issues, just retry staring the informer with some exponential backoff, and log warning. Assuming that it won't reconcile the resource anyways since there will be new event from that namespace anyways.

And admins will eventually remove the revoked namespace from the configuration.

@gyfora
Copy link

gyfora commented Aug 25, 2022

Yes I think the exponential backoff + warning is a good approach. In that case other informers/namespaces would not be affected.

@csviri
Copy link
Collaborator Author

csviri commented Aug 25, 2022

Yes, that will cover this case in a relatively simple way.

But would put there a feature flag, by default will exit if one of the informers is stopped.

And later we can improve this by metrics and readyness probes.

cc @andreaTP

@gyfora
Copy link

gyfora commented Aug 25, 2022

sound great :)

@everettraven
Copy link

As discussed during the community meeting, here is all the PoC caching stuff I have been working on for how Go operators can handle dynamically changing permissions via RBAC:

@andreaTP
Copy link
Collaborator

andreaTP commented Sep 5, 2022

Another edge case that is relevant:
fabric8io/kubernetes-client#4388

@csviri
Copy link
Collaborator Author

csviri commented Oct 25, 2022

see related issue: fabric8io/kubernetes-client#4527

@andreaTP
Copy link
Collaborator

🎉 great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants