ambient: fix nil pointer when pod cache is stale #50878

howardjohn · 2024-05-07T00:26:45Z

I ran into this in an extremely bespoke and unsupported environment,
but I think it could occur in real world. We are looping outside of
GetPodIfAmbient for the pod to show up, but if it fails we panic. We
want to instead get an error.

I ran into this in an *extremely* bespoke and unsupported environment, but I think it could occur in real world. We are looping outside of GetPodIfAmbient for the pod to show up, but if it fails we panic. We want to instead get an error.

howardjohn · 2024-05-07T00:37:11Z

I think this logic isn't quite right. will fix it up tomorrow

istio-policy-bot · 2024-05-07T18:33:45Z

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

bleggett · 2024-05-07T18:40:31Z

We are looping outside of GetPodIfAmbient for the pod to show up, but if it fails we panic. We want to instead get an error.

I assume you mean PodRedirectionEnabled panics if pod is nil, since we were already returning/getting errors in the other spots?

howardjohn · 2024-05-07T18:43:07Z

We are looping outside of GetPodIfAmbient for the pod to show up, but if it fails we panic. We want to instead get an error.

I assume you mean PodRedirectionEnabled panics if pod is nil, since we were already returning/getting errors in the other spots?

yes

bleggett · 2024-05-07T18:44:30Z

cni/pkg/nodeagent/cni-watcher.go

@@ -200,3 +186,32 @@ func (s *CniPluginServer) ReconcileCNIAddEvent(ctx context.Context, addCmd CNIPl

 	return nil
 }
+
+func (s *CniPluginServer) getPodWithRetry(log *istiolog.Scope, name, namespace string) (*corev1.Pod, error) {


I think this was mentioned in other PRs, but if this is now its own function, it would be good to cover behavior with unit tests in cni-watcher_test.go (the lack of this is probably why we had this bug in the first place)

We didn't do that before because it would be a slow test due to the timeouts, but if this is a private func, can we just pass the timeouts in as args and test quick iterations in a unit test to codify correct behavior here?

The way we have things abstracted actually still makes this pretty hard since we don't actually call this except in cni-watcher where all them mocking makes i tricky to flow through

Looks like it could be a copypaste of one of the existing tests (e.g. TestCNIPluginServer), but we simply call getPodWithRetry directly without adding an underlying pod to the fake k8s client, right?

That looks like that would simulate querying for a pod the server client lacks.

ilrudie · 2024-05-07T20:25:07Z

cni/pkg/nodeagent/informers.go

@@ -82,12 +82,19 @@ func setupHandlers(ctx context.Context, kubeClient kube.Client, dataplane MeshDa
 	return s
 }

+// GetPodIfAmbient looks up a pod. It returns:
+// * An error if the pod cannot be found
+// * nil if the pod is found, but does not have ambient enabled


Is this just using nil, nil as a sentinel value? Should we just return a different error value for when the pod is found but isn't configured as we expect?

MorrisLaw · 2024-05-07T20:30:03Z

cni/pkg/nodeagent/cni-watcher.go

+
+func (s *CniPluginServer) getPodWithRetry(log *istiolog.Scope, name, namespace string) (*corev1.Pod, error) {
+	log.Debugf("Checking pod: %s in ns: %s is enabled for ambient", name, namespace)
+	maxStaleRetries := 10


I know this wasn't the case in the previous code, but does it make sense to declare these as constants?

MorrisLaw · 2024-05-07T20:38:56Z

cni/pkg/nodeagent/cni-watcher.go

@@ -200,3 +186,32 @@ func (s *CniPluginServer) ReconcileCNIAddEvent(ctx context.Context, addCmd CNIPl

 	return nil
 }
+
+func (s *CniPluginServer) getPodWithRetry(log *istiolog.Scope, name, namespace string) (*corev1.Pod, error) {
+	log.Debugf("Checking pod: %s in ns: %s is enabled for ambient", name, namespace)


Suggested change

log.Debugf("Checking pod: %s in ns: %s is enabled for ambient", name, namespace)

log.Debugf("Checking if pod %s/%s is enabled for ambient", name, namespace)

nit suggestion to better match the non-colon strings we use elsewhere in this PR.

Current:

Checking pod: foo in ns: bar is enabled for ambient

Suggestion:

Checking if pod foo/bar is enabled for ambient

double nit, can we do namespace/name if we're going to make a change?

howardjohn requested a review from a team as a code owner May 7, 2024 00:26

howardjohn added release-notes-none Indicates a PR that does not require release notes. cherrypick/release-1.22 Set this label on a PR to auto-merge it to the release-1.22 branch labels May 7, 2024

istio-testing added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 7, 2024

ambient: fix nil pointer when pod cache is stale

db9c0c9

I ran into this in an *extremely* bespoke and unsupported environment, but I think it could occur in real world. We are looping outside of GetPodIfAmbient for the pod to show up, but if it fails we panic. We want to instead get an error.

howardjohn force-pushed the ambient/fix-nil branch from 8da0621 to db9c0c9 Compare May 7, 2024 00:32

istio-testing added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 7, 2024

howardjohn added the do-not-merge/hold Block automatic merging of a PR. label May 7, 2024

cleanup

88bbbc1

istio-testing removed the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 7, 2024

howardjohn assigned bleggett May 7, 2024

istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 7, 2024

howardjohn added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed do-not-merge/hold Block automatic merging of a PR. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 7, 2024

bleggett reviewed May 7, 2024

View reviewed changes

ilrudie reviewed May 7, 2024

View reviewed changes

MorrisLaw reviewed May 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ambient: fix nil pointer when pod cache is stale #50878

ambient: fix nil pointer when pod cache is stale #50878

howardjohn commented May 7, 2024

howardjohn commented May 7, 2024

istio-policy-bot commented May 7, 2024

bleggett commented May 7, 2024 •

edited

howardjohn commented May 7, 2024

bleggett May 7, 2024

howardjohn May 7, 2024

bleggett May 7, 2024 •

edited

ilrudie May 7, 2024

MorrisLaw May 7, 2024

MorrisLaw May 7, 2024

ilrudie May 7, 2024

	log.Debugf("Checking pod: %s in ns: %s is enabled for ambient", name, namespace)
	log.Debugf("Checking if pod %s/%s is enabled for ambient", name, namespace)

ambient: fix nil pointer when pod cache is stale #50878

Are you sure you want to change the base?

ambient: fix nil pointer when pod cache is stale #50878

Conversation

howardjohn commented May 7, 2024

howardjohn commented May 7, 2024

istio-policy-bot commented May 7, 2024

bleggett commented May 7, 2024 • edited

howardjohn commented May 7, 2024

bleggett May 7, 2024

Choose a reason for hiding this comment

howardjohn May 7, 2024

Choose a reason for hiding this comment

bleggett May 7, 2024 • edited

Choose a reason for hiding this comment

ilrudie May 7, 2024

Choose a reason for hiding this comment

MorrisLaw May 7, 2024

Choose a reason for hiding this comment

MorrisLaw May 7, 2024

Choose a reason for hiding this comment

ilrudie May 7, 2024

Choose a reason for hiding this comment

bleggett commented May 7, 2024 •

edited

bleggett May 7, 2024 •

edited