Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add possibility to add an exception handler to Watchers #4365

Merged
merged 10 commits into from Sep 8, 2022

Conversation

metacosm
Copy link
Collaborator

@metacosm metacosm commented Aug 30, 2022

Description

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change
  • Chore (non-breaking change which doesn't affect codebase;
    test, version modification, documentation, etc.)

Checklist

  • Code contributed by me aligns with current project license: Apache 2.0
  • I Added CHANGELOG entry regarding this change
  • I have implemented unit tests to cover my changes
  • I have added/updated the javadocs and other documentation accordingly
  • No new bugs, code smells, etc. in SonarCloud report
  • I tested my code in Kubernetes
  • I tested my code in OpenShift

@metacosm metacosm added this to the 6.1.0 milestone Aug 30, 2022
@metacosm metacosm self-assigned this Aug 30, 2022
@shawkins
Copy link
Contributor

@metacosm I think it makes sense, but it may not match the go watch behavior. It looks like for all error scenarios in a watch they will create an error event - https://github.com/kubernetes/client-go/blob/master/tools/watch/retrywatcher.go#L160 and possibly terminate the watch. That isn't a great mach for us as we are using KubernetesResource, rather than raw type for the object.

Does it make sense to try to reuse the error WatchEvent for this error handling?

The error event handling in go in the reflector for an unknown error just seems to be to log and retry: https://sourcegraph.com/github.com/kubernetes/client-go/-/blob/tools/cache/reflector.go?L347 - so it doesn't currently handle this scenario very well either.

@manusa
Copy link
Member

manusa commented Aug 31, 2022

We discussed this internally in yesterday's call.

The main drawback with the currently proposed approach is that we already have:

  • Watcher.Action.ERROR event that might be consumed by the Watcher.eventReceived callback method
  • Watcher.onClose(WatcherException cause) that is called whenever the watcher can't be started or an HTTP_GONE status is received

The new Exception Handler would add a third option, making it hard for users to understand what each of these callbacks or points of entry might entail (e.g. would onException be called too when onClose(Exception) is closed? would eventReceived(ERROR) trigger an onException too? and so on).

IMHO this is bad for the overall Informer + Watcher UX.

Does it make sense to try to reuse the error WatchEvent for this error handling?

I was considering two other options besides the one proposed by Chris (none of which makes me happy TBH)

  1. Add a new Action.CLIENT_ERROR so that we can call Watcher.eventReceived with this action whenever there is a client-side problem when processing the received events. Apparently this seems to align with client-go, from what you said in your comment.
  2. Add a configuration flag (disabled by default) so that when enabled, the watcher is closed (onClose(Exception)) in case one of these Exceptions is produced. I think Chris and Andrea didn't like this one very much.

As you know, I'm inclined to provide whatever client-go is doing for UX consistency purposes.

@manusa manusa modified the milestones: 6.1.0, 6.2.0 Aug 31, 2022
@metacosm
Copy link
Collaborator Author

Does it make sense to try to reuse the error WatchEvent for this error handling?

I was considering two other options besides the one proposed by Chris (none of which makes me happy TBH)

  1. Add a new Action.CLIENT_ERROR so that we can call Watcher.eventReceived with this action whenever there is a client-side problem when processing the received events. Apparently this seems to align with client-go, from what you said in your comment.

This might be the best option… How should we proceed?

  1. Add a configuration flag (disabled by default) so that when enabled, the watcher is closed (onClose(Exception)) in case one of these Exceptions is produced. I think Chris and Andrea didn't like this one very much.

The problem with this option is that it doesn't provide any flexibility: the calling code cannot decide what to do apart from crashing or not, which isn't terribly useful most of the time, especially in user code where people might not have direct access to the informers/watchers.

@shawkins
Copy link
Contributor

@manusa @metacosm

This might be the best option… How should we proceed?

Looking things over some more in all of the cases in which the go client creates its own Error event, it also terminates the watch. So I think it would make sense for us to just terminate the watch and call onClose with the relevant exception instead.

When the informer sees an non-http gone onClose exception, it will stop.

If we add a additional handler for informers, that would be an alternative to monitoring the isWatching method - so that you get a callback if there is an abnormal termination. That notification would need to happen here

log.warn("Watch closing with exception for {}", Reflector.this, exception);

The resource too old exceptions are recoverable, so we likely don't need to notify the handler.

Also the consistency with the go client seems cumbersome. The go client doesn't introduce another enum type, you'd probably just reuse Watcher.Action.Error. The event object would then be a Status which matches docs from https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#watchevent-v1-meta and what the go client is doing.

@metacosm
Copy link
Collaborator Author

@manusa @metacosm

This might be the best option… How should we proceed?

Looking things over some more in all of the cases in which the go client creates its own Error event, it also terminates the watch. So I think it would make sense for us to just terminate the watch and call onClose with the relevant exception instead.

What are the implications, though? The scenario we're trying to solve is an edge case where a serialisation issue might occur with one CR instance. If we stop the watcher, what does it mean for the informer? Will it need to get restarted (and resynched) knowing that the same CR might crash it again? This particular case should only really happen during development (though we all know that issues that are supposed to never happen in prod have a tendency to do so anyway 😅) so this is more about the developer experience than anything else but the point is to make it easier to detect (and presumably) fix such issues without having to sift through very long logs when it's really easy to miss the problem when you don't know what you're looking for.

When the informer sees an non-http gone onClose exception, it will stop.

If we add a additional handler for informers, that would be an alternative to monitoring the isWatching method - so that you get a callback if there is an abnormal termination. That notification would need to happen here

The problem with isWatching is that you'd need (from a calling code perspective) to constantly monitor it to know that something wrong happened. Or am I missing something?

Also the consistency with the go client seems cumbersome. The go client doesn't introduce another enum type, you'd probably just reuse Watcher.Action.Error. The event object would then be a Status which matches docs from https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#watchevent-v1-meta and what the go client is doing.

I personally don't think that we should care all that much about what client-go is doing… Trying to somewhat follow what it does but not quite seems actually worse to me than doing something completely different because then you get surprised when the behavior differs even though it looks like it should be the same…

@shawkins
Copy link
Contributor

shawkins commented Aug 31, 2022

The scenario we're trying to solve is an edge case where a serialisation issue might occur with one CR instance. If we stop the watcher, what does it mean for the informer? Will it need to get restarted (and resynched) knowing that the same CR might crash it again?

No, when the informer sees an non-http gone onClose exception, it will stop.

The problem with isWatching is that you'd need (from a calling code perspective) to constantly monitor it to know that something wrong happened. Or am I missing something?

I'm not saying you have to rely on isWatching, just pointing out that if add something new on top of the change to close the watch with a non-http gone exception it's an alternative callback to isWatching.

@metacosm
Copy link
Collaborator Author

The scenario we're trying to solve is an edge case where a serialisation issue might occur with one CR instance. If we stop the watcher, what does it mean for the informer? Will it need to get restarted (and resynched) knowing that the same CR might crash it again?

No, when the informer sees an non-http gone onClose exception, it will stop.

If the informer is not processing the other events then what is the point? The solution we're aiming for is to not disable the processing of CRs of a given type just because one instance is somehow causing an issue… If the informer attempts to restart just to get stopped again because of that one CR, that's a nice way to cause denial of service.

The problem with isWatching is that you'd need (from a calling code perspective) to constantly monitor it to know that something wrong happened. Or am I missing something?

I'm not saying you have to rely on isWatching, just pointing out that if add something new on top of the change to close the watch with a non-http gone exception it's an alternative callback to isWatching.

Not sure I understand what you're trying to say? Maybe we should explore options using mock code so that we can look at how things would look like from a user's perspective?

@csviri
Copy link
Contributor

csviri commented Aug 31, 2022

This is probably related:
#4369

@shawkins
Copy link
Contributor

If the informer is not processing the other events then what is the point? The solution we're aiming for is to not disable the processing of CRs of a given type just because one instance is somehow causing an issue… If the informer attempts to restart just to get stopped again because of that one CR, that's a nice way to cause denial of service.

Again the informer will not restart with what I'm describing.

Not sure I understand what you're trying to say? Maybe we should explore options using mock code so that we can look at how things would look like from a user's perspective?

It should be clearer as add-exception-handler...shawkins:kubernetes-client:add-exception-handler

This is probably related: #4369
Have a more configurable retry options (with a possibility to indefinitely retry) for this use case.

That could look a little different that the commit I just showed. Instead of void onWatchNonrecoverable(WatcherException e), it would be boolean onWatchNonrecoverable(WatcherException e) - so that the handler can choose whether to shutdown.

@metacosm
Copy link
Collaborator Author

If the informer is not processing the other events then what is the point? The solution we're aiming for is to not disable the processing of CRs of a given type just because one instance is somehow causing an issue… If the informer attempts to restart just to get stopped again because of that one CR, that's a nice way to cause denial of service.

Again the informer will not restart with what I'm describing.

Not sure I understand what you're trying to say? Maybe we should explore options using mock code so that we can look at how things would look like from a user's perspective?

It should be clearer as add-exception-handler...shawkins:kubernetes-client:add-exception-handler

Thanks! 👀

This is probably related: #4369
Have a more configurable retry options (with a possibility to indefinitely retry) for this use case.

That could look a little different that the commit I just showed. Instead of void onWatchNonrecoverable(WatcherException e), it would be boolean onWatchNonrecoverable(WatcherException e) - so that the handler can choose whether to shutdown.

Yep, the handler should have the opportunity to ask for a graceful shutdown of the watcher, indeed.

@shawkins
Copy link
Contributor

shawkins commented Sep 1, 2022

The error event handling in go in the reflector for an unknown error just seems to be to log and retry: https://sourcegraph.com/github.com/kubernetes/client-go/-/blob/tools/cache/reflector.go?L347 - so it doesn't currently handle this scenario very well either.

Looked again, there is actually - https://sourcegraph.com/github.com/kubernetes/client-go@7ccf7b05af286664a658af15c8502a6066ae3288/-/blob/tools/cache/shared_informer.go?L183

But it's informational only - the informer will continue to retry using a backoff (which is not implemented for ours yet).

Another go implementation change looks to be that the initial informer start is no longer handled as a special case - that is an exception during the initial list/watch won't cause the informer not to start, rather it will just proceed to looping with the backoff.

Also I think switching to all async calls has introduced a regression for us - previously if we failed during listSyncWatch it was the WatchManager who would retry after catching the exception. Now the reflector after a resource too old exception will retry only once to call listSyncWatch - that means if we see a resource too old (which is now rarer due to bookmarks), then fail to list after the watch interval, and fail again - the informer / reflector will stop trying. So that needs addressed as well.

@csviri
Copy link
Contributor

csviri commented Sep 2, 2022

But it's informational only - the informer will continue to retry using a backoff (which is not implemented for ours yet).

Another go implementation change looks to be that the initial informer start is no longer handled as a special case - that is an exception during the initial list/watch won't cause the informer not to start, rather it will just proceed to looping with the backoff.

Note that this is exactly what we need in the other issue: #4369
So make it configurable when to retry (or not).

@csviri
Copy link
Contributor

csviri commented Sep 2, 2022

@metacosm I don't how this PR addresses the core problem to be able to process the raw resource if not able to deserialize.
issue in JOSDK: operator-framework/java-operator-sdk#1170

@csviri
Copy link
Contributor

csviri commented Sep 2, 2022

In summary we have two goals and related issues in JOSDK:

  1. Operator should trigger the error state of the CR when deserialization fails operator-framework/java-operator-sdk#1170 - to handle situation when a custom resource cannot be de-serailized, and have a callback to handle that resource.
  2. Umbrella issue for handling Informer Permission Errors and Probes operator-framework/java-operator-sdk#1422 - when a permission removed to a resource (or not present on startup) we would like to have a possibility periodically try to reconnect the watch.

This might be one callback method on the informer level to cover both.

@shawkins shawkins force-pushed the add-exception-handler branch 2 times, most recently from 285acb3 to c596971 Compare September 2, 2022 12:02
@shawkins
Copy link
Contributor

shawkins commented Sep 2, 2022

@metacosm @csviri I've now hijacked this pr with something that should address #4369 as well. This pr does not introduce a backoff - we're still using the watch retry delay. Nor does it change the startup - that is potentially breaking change, so we likely want to handle it separately. Let me know if you have any issues with the draft, then I'll clean this up and add some tests.

Copy link
Collaborator Author

@metacosm metacosm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach sounds good to me, though I guess ideally, we'd also pass the raw object that caused the deserialisation error if at all possible…

@shawkins shawkins marked this pull request as ready for review September 7, 2022 11:22
@shawkins
Copy link
Contributor

shawkins commented Sep 7, 2022

Marking as ready for review. Added a couple of tests and future cleaned up the reconnect logic. Also added more conformity to the go retry watcher logic in our abstract watch manager. There were several cases where it handles the retry on its own rather than relying on downstream logic. Also the raw message has been added to the Watch exception in more cases.

also stopping messages when a watch closes
@sonarcloud
Copy link

sonarcloud bot commented Sep 8, 2022

SonarCloud Quality Gate failed.    Quality Gate failed

Bug B 3 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

41.5% 41.5% Coverage
0.0% 0.0% Duplication

@shawkins
Copy link
Contributor

shawkins commented Sep 8, 2022

@manusa I would opt for squashing this given how many intermediate commits no longer make sense.

@manusa
Copy link
Member

manusa commented Sep 8, 2022

If there are no merge commits I'll leave just two commits, with the initial approach (Chris), and your final take and refactor.

Ok, just saw those merge master now. Squashing

@metacosm
Copy link
Collaborator Author

metacosm commented Sep 8, 2022

Actually wondering if we could release a new version and get that version in Quarkus before 2.13 is cut… 😨

@manusa manusa merged commit 951a7f5 into master Sep 8, 2022
@manusa manusa deleted the add-exception-handler branch September 8, 2022 16:59
@andreaTP
Copy link
Member

andreaTP commented Sep 8, 2022

🎉 Amazing job, thanks to everyone involved!

shawkins added a commit to shawkins/kubernetes-client that referenced this pull request Sep 14, 2022
shawkins added a commit to shawkins/kubernetes-client that referenced this pull request Sep 14, 2022
shawkins added a commit to shawkins/kubernetes-client that referenced this pull request Oct 2, 2022
also adding stacktraces for blocking request exceptions
@shawkins shawkins mentioned this pull request Oct 2, 2022
11 tasks
shawkins added a commit to shawkins/kubernetes-client that referenced this pull request Oct 5, 2022
also adding stacktraces for blocking request exceptions
manusa pushed a commit that referenced this pull request Oct 7, 2022
also adding stacktraces for blocking request exceptions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants