Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ambient: fix nil pointer when pod cache is stale #50878

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

howardjohn
Copy link
Member

I ran into this in an extremely bespoke and unsupported environment,
but I think it could occur in real world. We are looping outside of
GetPodIfAmbient for the pod to show up, but if it fails we panic. We
want to instead get an error.

@howardjohn howardjohn requested a review from a team as a code owner May 7, 2024 00:26
@howardjohn howardjohn added release-notes-none Indicates a PR that does not require release notes. cherrypick/release-1.22 Set this label on a PR to auto-merge it to the release-1.22 branch labels May 7, 2024
@istio-testing istio-testing added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 7, 2024
I ran into this in an *extremely* bespoke and unsupported environment,
but I think it could occur in real world. We are looping outside of
GetPodIfAmbient for the pod to show up, but if it fails we panic. We
want to instead get an error.
@istio-testing istio-testing added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 7, 2024
@howardjohn howardjohn added the do-not-merge/hold Block automatic merging of a PR. label May 7, 2024
@howardjohn
Copy link
Member Author

I think this logic isn't quite right. will fix it up tomorrow

@istio-testing istio-testing removed the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 7, 2024
@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 7, 2024
@howardjohn howardjohn added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed do-not-merge/hold Block automatic merging of a PR. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 7, 2024
@istio-policy-bot
Copy link

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

@bleggett
Copy link
Contributor

bleggett commented May 7, 2024

We are looping outside of GetPodIfAmbient for the pod to show up, but if it fails we panic. We want to instead get an error.

I assume you mean PodRedirectionEnabled panics if pod is nil, since we were already returning/getting errors in the other spots?

@howardjohn
Copy link
Member Author

We are looping outside of GetPodIfAmbient for the pod to show up, but if it fails we panic. We want to instead get an error.

I assume you mean PodRedirectionEnabled panics if pod is nil, since we were already returning/getting errors in the other spots?

yes

@@ -200,3 +186,32 @@ func (s *CniPluginServer) ReconcileCNIAddEvent(ctx context.Context, addCmd CNIPl

return nil
}

func (s *CniPluginServer) getPodWithRetry(log *istiolog.Scope, name, namespace string) (*corev1.Pod, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was mentioned in other PRs, but if this is now its own function, it would be good to cover behavior with unit tests in cni-watcher_test.go (the lack of this is probably why we had this bug in the first place)

We didn't do that before because it would be a slow test due to the timeouts, but if this is a private func, can we just pass the timeouts in as args and test quick iterations in a unit test to codify correct behavior here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we have things abstracted actually still makes this pretty hard since we don't actually call this except in cni-watcher where all them mocking makes i tricky to flow through

Copy link
Contributor

@bleggett bleggett May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it could be a copypaste of one of the existing tests (e.g. TestCNIPluginServer), but we simply call getPodWithRetry directly without adding an underlying pod to the fake k8s client, right?

That looks like that would simulate querying for a pod the server client lacks.

@@ -82,12 +82,19 @@ func setupHandlers(ctx context.Context, kubeClient kube.Client, dataplane MeshDa
return s
}

// GetPodIfAmbient looks up a pod. It returns:
// * An error if the pod cannot be found
// * nil if the pod is found, but does not have ambient enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just using nil, nil as a sentinel value? Should we just return a different error value for when the pod is found but isn't configured as we expect?


func (s *CniPluginServer) getPodWithRetry(log *istiolog.Scope, name, namespace string) (*corev1.Pod, error) {
log.Debugf("Checking pod: %s in ns: %s is enabled for ambient", name, namespace)
maxStaleRetries := 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this wasn't the case in the previous code, but does it make sense to declare these as constants?

@@ -200,3 +186,32 @@ func (s *CniPluginServer) ReconcileCNIAddEvent(ctx context.Context, addCmd CNIPl

return nil
}

func (s *CniPluginServer) getPodWithRetry(log *istiolog.Scope, name, namespace string) (*corev1.Pod, error) {
log.Debugf("Checking pod: %s in ns: %s is enabled for ambient", name, namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Debugf("Checking pod: %s in ns: %s is enabled for ambient", name, namespace)
log.Debugf("Checking if pod %s/%s is enabled for ambient", name, namespace)

nit suggestion to better match the non-colon strings we use elsewhere in this PR.

Current:

Checking pod: foo in ns: bar is enabled for ambient

Suggestion:

Checking if pod foo/bar is enabled for ambient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double nit, can we do namespace/name if we're going to make a change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherrypick/release-1.22 Set this label on a PR to auto-merge it to the release-1.22 branch release-notes-none Indicates a PR that does not require release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants