Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

where resources already exist but requests are generated by CREATE #2290

Closed
10000-ki opened this issue Mar 15, 2024 · 43 comments · Fixed by #2309
Closed

where resources already exist but requests are generated by CREATE #2290

10000-ki opened this issue Mar 15, 2024 · 43 comments · Fixed by #2309

Comments

@10000-ki
Copy link

10000-ki commented Mar 15, 2024

Bug Report

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://172.23.0.1:443/apis/policy/v1/namespaces/n3r-platform-opensearch/poddisruptionbudgets/os-osfarm-cluster-manager-pdb?fieldManager=opensearchreconciler&force=true. Message: Operation cannot be fulfilled on poddisruptionbudgets.policy "os-osfarm-cluster-manager-pdb": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=policy, kind=poddisruptionbudgets, name=os-osfarm-cluster-manager-pdb, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on poddisruptionbudgets.policy "os-osfarm-cluster-manager-pdb": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handlePatch(OperationSupport.java:419)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handlePatch(OperationSupport.java:397)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handlePatch(BaseOperation.java:713)
	at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$patch$2(HasMetadataOperation.java:232)
	at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:237)
	at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:252)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:1132)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:92)
	at io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource.create(KubernetesDependentResource.java:140)
	at io.javaoperatorsdk.operator.processing.dependent.kubernetes.CRUDKubernetesDependentResource.create(CRUDKubernetesDependentResource.java:16)
	at io.javaoperatorsdk.operator.processing.dependent.AbstractDependentResource.handleCreate(AbstractDependentResource.java:114)
	at io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource.handleCreate(KubernetesDependentResource.java:111)
	at io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource.handleCreate(KubernetesDependentResource.java:32)
	at io.javaoperatorsdk.operator.processing.dependent.AbstractDependentResource.reconcile(AbstractDependentResource.java:62)
	at io.javaoperatorsdk.operator.processing.dependent.SingleDependentResourceReconciler.reconcile(SingleDependentResourceReconciler.java:19)
	at io.javaoperatorsdk.operator.processing.dependent.AbstractDependentResource.reconcile(AbstractDependentResource.java:52)
	at io.javaoperatorsdk.operator.processing.dependent.workflow.WorkflowReconcileExecutor$NodeReconcileExecutor.doRun(WorkflowReconcileExecutor.java:115)
	at io.javaoperatorsdk.operator.processing.dependent.workflow.NodeExecutor.run(NodeExecutor.java:22)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

at io.javaoperatorsdk.operator.processing.dependent.AbstractDependentResource.handleCreate(AbstractDependentResource.java:114)

스크린샷 2024-03-15 오후 5 41 05

What did you do?

Creating PDB with DependentResource

However, even though resources are already created

k get PodDisruptionBudget                                                                                                                                                                                                                                           ✔  10357  17:43:26
NAME                            MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
os-osfarm-cluster-manager-pdb   N/A             1                 1                     88m

When running additional 'reconcile' loop

The request was created as Create rather than Update

If there is a secondary resource, shouldn't the request go to update right?

Why is it created as create?

409 errors continue to occur, which is a problem

What did you expect to see?

What did you see instead? Under which circumstances?

no

Environment

Kubernetes cluster type:

Kubernetes

$ Mention java-operator-sdk version from pom.xml file

v4.4.4

$ java -version

jdk21

$ kubectl version

v1.23.15

Possible Solution

no

Additional context

no

@csviri
Copy link
Collaborator

csviri commented Mar 15, 2024

Hi @10000-ki there were improvements regarding to this in newer versions, pls upgrade to the newest version

@10000-ki
Copy link
Author

@csviri what version??
I raised it to 4.6.0, but there was the same issue

@csviri
Copy link
Collaborator

csviri commented Mar 15, 2024

pls try the newest (4.8.1), if there will be still issues, we will take a look. Note that k8s 1.23 is also a bit old too, there were some issues with event sending in that version (what should be fixed now also on fabric8 client (restarting watches) ). But if the problem persist we can take a look, and try to make a reproducer. Probably the issue will be rather in the fabric8 client, since we are reading the last resource from the informer cache.

@10000-ki
Copy link
Author

Note that k8s 1.23 is also a bit old too, there were some issues with event sending in that version (what should be fixed now also on fabric8 client (restarting watches) )

I'm using in-house k8s cluster in my company but v1.24 is maximum version :(
will that be a problem?

@10000-ki
Copy link
Author

First, I'll increase the josdk version to the maximum v4.8.1

@csviri
Copy link
Collaborator

csviri commented Mar 15, 2024

AFAIK, it should not be, there were known issues, that k8s stopped sending events at some point (if remember correctly still present in 1.23) therefore the watch had to be restarted in informers periodically. But that was done in fabric8, so now it should be good.

@10000-ki
Copy link
Author

k get pdb --watch -n=opensearchoperatore2e-createopensearchcluster-6361d -o wide                                                                                                                                                                                  1 ↵  10373  18:15:17
NAME                                             MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m6s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m6s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m6s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m6s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m6s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m6s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m7s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m7s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m7s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m7s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m7s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m7s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m7s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m8s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m8s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m8s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m8s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m8s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m8s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m8s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m9s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m9s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m9s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m9s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m9s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m9s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m9s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m10s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m10s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m10s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m10s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m10s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m10s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m10s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m11s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m11s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m11s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m11s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m11s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m11s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m11s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m12s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m12s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m12s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m12s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m12s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m12s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m12s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m13s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m13s
os-test-opensearch-cluster-cluster-manager-pdb   N/A             1                 0                     2m13s

Is there anything else I need to check?
After posting the v4.8.1 version
There are a lot of events then before

and Create issues are still occurring

@csviri
Copy link
Collaborator

csviri commented Mar 15, 2024

ok, could you pls create a minimal open source reproducert project? Where we can run it on top of minikube and take a look on it, based on logs you provided it is impossible to figure out.

@csviri
Copy link
Collaborator

csviri commented Mar 15, 2024

Also what you can you can check if this is a problem on the newer versions of k8s.

@10000-ki
Copy link
Author

ok, could you pls create a minimal open source reproducert project? Where we can run it on top of minikube and take a look on it, based on logs you provided it is impossible to figure out.

okay i will try

@10000-ki
Copy link
Author

@csviri
In the k8s 1.29 version, it seems to run smoothly (i tested in my local)

By the way, can you tell me exactly which version had the issue related to the k8s event and up to what version the impact is?

@csviri
Copy link
Collaborator

csviri commented Mar 18, 2024

@csviri In the k8s 1.29 version, it seems to run smoothly (i tested in my local)

By the way, can you tell me exactly which version had the issue related to the k8s event and up to what version the impact is?

Pls take a look on these:
fabric8io/kubernetes-client#4781
kubernetes/kubernetes#102464

but there might be some newer ones. Maybe @shawkins knows more.

@shawkins
Copy link
Collaborator

I'm not aware of newer definitive underlying kube issues. There were several later client changes to limit any potential issues with the api server not sending events - we'll restart the watches at most every 10 minutes.

@10000-ki
Copy link
Author

10000-ki commented Mar 19, 2024

@csviri @shawkins

I don't know if it's related to the problem I raised.

The problems I defined are as follows.

  1. even if secondary resource already exists, Create request sent to apserver, not update
  2. Retry occurs with 409 error
  3. Infinite retries occur as issue 1 continues to occur

스크린샷 2024-03-19 오후 3 06 18

mayBeActual field is null, even though resources are in k8s server

Is it because the operator did not receive the resource creation event from k8s?

@csviri
Copy link
Collaborator

csviri commented Mar 19, 2024

Is it because the operator did not receive the resource creation event from k8s?

yes, that means it is not in the cache of the informer, so no event received regarding that resource.

@10000-ki
Copy link
Author

yes, that means it is not in the cache of the informer, so no event received regarding that resource.

i see

Pls take a look on these:
fabric8io/kubernetes-client#4781
kubernetes/kubernetes#102464

From the above, it seems to have been resolved in k8s v1.21, fabric v6.5.0, let me check a little more.

@10000-ki
Copy link
Author

10000-ki commented Mar 19, 2024

@csviri

And when i raise the josdk version to 4.4.4 -> 4.5.0, resourceVersion keeps going up
I saw that there was a problem.

kubernetes/kubernetes#106388

This seems to have been fixed in k8s v1.25

May I know why the problem occurs in v4.5.0 after no problem in v4.4.4?

6b9248f

Is this modification relevant?

@csviri
Copy link
Collaborator

csviri commented Mar 19, 2024

In what situation it exactly goes up?

my bet it is rather related to this change: f8137f2

@10000-ki
Copy link
Author

@csviri

https://github.com/operator-framework/java-operator-sdk/pull/2012/files#diff-aa20588ab4b1ff4f171a897d4042d1b055c02737b067d09aaeb9cfd770adf3a0R179

This change seems to have an impact.
I think josdk deleting & additional values in annotation to manage previous state values
at that time resourceVersion is updated.

@csviri
Copy link
Collaborator

csviri commented Mar 19, 2024

This change seems to have an impact.
I think josdk deleting & additional values in annotation to manage previous state values
at that time resourceVersion is updated.

yes, but this should not be a problem, since the resource is ideally updated only when desired changed, thus when there needs to be an update (patch with SSA in this case) anyways. So there would be anyway a resourceVersion change.

Do you have an other controller making changes on the same resources?

see also: #2249

@10000-ki
Copy link
Author

@csviri

yes, but this should not be a problem, since the resource is ideally updated only when desired changed, thus when there needs to be an update (patch with SSA in this case) anyways. So there would be anyway a resourceVersion change.

A change occurs in dependentResource's resourceVersion
It seems that this detects that the data is changed through 'Informer' and the event is coming in.

The method desired in DependentResource continues to be called

@10000-ki
Copy link
Author

Do you have an other controller making changes on the same resources?

nope

@csviri
Copy link
Collaborator

csviri commented Mar 19, 2024

A change occurs in dependentResource's resourceVersion
It seems that this detects that the data is changed through 'Informer' and the event is coming in.
The method desired in DependentResource continues to be called

Yes, but in the next reconciliation it update should be skipped. The matcher would detect that the resources are the same but also there as other mechanism to detect, that.

Can you create a reproducer pls? This is not happening in our test scenarios.

@10000-ki
Copy link
Author

see also: #2249

@csviri thank you for sharing

Yes, but in the next reconciliation it update should be skipped. The matcher would detect that the resources are the same but also there as other mechanism to detect, that.

in my case It falls into an infinite loop.

https://github.com/10000-ki/test-operator/tree/main/src/main/kotlin/opensearch/operator/api/v1/cr/dependent

can you see this code?
i gave you permission

The actual code is on the github enterprise in my company.

I just brought the operator part of the actual project

@csviri
Copy link
Collaborator

csviri commented Mar 19, 2024

@10000-ki this seems to be a relative complex project, could you pls create a simplistic reproducer - ideally with 1 dependent resource that causes the issue (and in java + maven If possible) that would speed up this very much, thx

@10000-ki
Copy link
Author

10000-ki commented Mar 19, 2024

this seems to be a relative complex project, could you pls create a simplistic reproducer - ideally with 1 dependent resource that causes the issue (and in java + maven If possible) that would speed up this very much, thx

yes i will try

@10000-ki
Copy link
Author

@csviri
on a random note
thank you for always watching the issue well, I'm a big fan of Josdk

@csviri
Copy link
Collaborator

csviri commented Mar 19, 2024

thank you @10000-ki !!

@10000-ki
Copy link
Author

10000-ki commented Mar 20, 2024

@csviri

Yes, but in the next reconciliation it update should be skipped. The matcher would detect that the resources are the same but also there as other mechanism to detect, that.

where can i see this mechanism logic??

i couldn't find that code

스크린샷 2024-03-20 오후 1 51 59

Shouldn't we add data to metadata only if the actual and design are different, like above?

sorry :(

i found it

스크린샷 2024-03-20 오후 5 26 50

@10000-ki
Copy link
Author

2024-03-20 17:37:45.250 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L232 - Using PrimaryToSecondaryIndex to find secondary resources for primary: ResourceID{name='osfarm', namespace='n3r-platform-opensearch'}. Found secondary ids: [ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'}, ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'}] 
2024-03-20 17:37:45.250 [pool-5-thread-8][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L225 - Updating target resource with type: class io.fabric8.kubernetes.api.model.apps.Deployment, with id: ResourceID{name='os-osfarm-dashboards', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.250 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L115 - Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.251 [pool-5-thread-5][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.AbstractDependentResource#L110 - Updating 'os-osfarm-coordinator' Deployment for primary ResourceID{name='osfarm', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.251 [pool-5-thread-5][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L126 - Updating actual resource: ResourceID{name='os-osfarm-coordinator', namespace='n3r-platform-opensearch'} version: 4787803795
2024-03-20 17:37:45.251 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L118 - Resource found in cache: true for id: ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.251 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L115 - Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.251 [pool-5-thread-5][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L225 - Updating target resource with type: class io.fabric8.kubernetes.api.model.apps.Deployment, with id: ResourceID{name='os-osfarm-coordinator', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.251 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L118 - Resource found in cache: true for id: ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.251 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.AbstractDependentResource#L110 - Updating 'os-osfarm-cluster-manager' StatefulSet for primary ResourceID{name='osfarm', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.252 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L126 - Updating actual resource: ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'} version: 4787803798
2024-03-20 17:37:45.252 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L225 - Updating target resource with type: class io.fabric8.kubernetes.api.model.apps.StatefulSet, with id: ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.260 [pool-5-thread-7][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L139 - Resource version after update: 4787803803
2024-03-20 17:37:45.260 [pool-5-thread-7][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L115 - Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='os-osfarm-certificates-secret', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.260 [pool-5-thread-7][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L118 - Resource found in cache: true for id: ResourceID{name='os-osfarm-certificates-secret', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.260 [pool-5-thread-7][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.TemporaryResourceCache#L91 - Temporarily moving ahead to target version 4787803803 for resource id: ResourceID{name='os-osfarm-certificates-secret', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.260 [pool-5-thread-7][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.workflow.WorkflowReconcileExecutor#L148 - Setting already reconciled for: DependentResourceNode{com.navercorp.opensearch.operator.api.v1.cr.dependent.secret.OpenSearchCertificatesSecret@516da3a7} primaryID: ResourceID{name='osfarm', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.261 [-56859396-pool-2-thread-4][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L123 - On update event received for resource id: ResourceID{name='os-osfarm-certificates-secret', namespace='n3r-platform-opensearch'} type: Secret version: 4787803803 old version: 4787803794 
2024-03-20 17:37:45.261 [-56859396-pool-2-thread-4][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L183 - Resource found in temporal cache for id: ResourceID{name='os-osfarm-certificates-secret', namespace='n3r-platform-opensearch'} resource versions equal: true
2024-03-20 17:37:45.262 [-56859396-pool-2-thread-4][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L154 - Skipping event propagation for UPDATE, since was a result of a reconcile action. Resource ID: ResourceID{name='os-osfarm-certificates-secret', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.275 [pool-5-thread-8][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L139 - Resource version after update: 4787803806
2024-03-20 17:37:45.275 [pool-5-thread-8][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L115 - Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='os-osfarm-dashboards', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.275 [-56859396-pool-2-thread-2][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L123 - On update event received for resource id: ResourceID{name='os-osfarm-dashboards', namespace='n3r-platform-opensearch'} type: Deployment version: 4787803806 old version: 4787803799 
2024-03-20 17:37:45.275 [pool-5-thread-8][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L118 - Resource found in cache: true for id: ResourceID{name='os-osfarm-dashboards', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.276 [pool-5-thread-8][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.workflow.WorkflowReconcileExecutor#L148 - Setting already reconciled for: DependentResourceNode{com.navercorp.opensearch.operator.api.v1.cr.dependent.deployment.OpenSearchDashboardsDeployment@5a8d22e5} primaryID: ResourceID{name='osfarm', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.276 [-56859396-pool-2-thread-2][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L154 - Skipping event propagation for UPDATE, since was a result of a reconcile action. Resource ID: ResourceID{name='os-osfarm-dashboards', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.277 [pool-5-thread-5][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L139 - Resource version after update: 4787803807
2024-03-20 17:37:45.277 [pool-5-thread-5][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L115 - Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='os-osfarm-coordinator', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.277 [-56859396-pool-2-thread-4][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L123 - On update event received for resource id: ResourceID{name='os-osfarm-coordinator', namespace='n3r-platform-opensearch'} type: Deployment version: 4787803807 old version: 4787803795 
2024-03-20 17:37:45.277 [pool-5-thread-5][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L118 - Resource found in cache: true for id: ResourceID{name='os-osfarm-coordinator', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.277 [pool-5-thread-5][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.TemporaryResourceCache#L91 - Temporarily moving ahead to target version 4787803807 for resource id: ResourceID{name='os-osfarm-coordinator', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.277 [pool-5-thread-5][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.workflow.WorkflowReconcileExecutor#L148 - Setting already reconciled for: DependentResourceNode{com.navercorp.opensearch.operator.api.v1.cr.dependent.deployment.OpenSearchNodeDeployment@78f577c6} primaryID: ResourceID{name='osfarm', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.277 [-56859396-pool-2-thread-4][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L183 - Resource found in temporal cache for id: ResourceID{name='os-osfarm-coordinator', namespace='n3r-platform-opensearch'} resource versions equal: true
2024-03-20 17:37:45.277 [-56859396-pool-2-thread-4][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L154 - Skipping event propagation for UPDATE, since was a result of a reconcile action. Resource ID: ResourceID{name='os-osfarm-coordinator', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.294 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L139 - Resource version after update: 4787803813
2024-03-20 17:37:45.294 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L115 - Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.294 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L118 - Resource found in cache: true for id: ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.294 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.TemporaryResourceCache#L91 - Temporarily moving ahead to target version 4787803813 for resource id: ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.294 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.AbstractDependentResource#L110 - Updating 'os-osfarm-data-hot' StatefulSet for primary ResourceID{name='osfarm', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.294 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L126 - Updating actual resource: ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'} version: 4787803801
2024-03-20 17:37:45.295 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L225 - Updating target resource with type: class io.fabric8.kubernetes.api.model.apps.StatefulSet, with id: ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.295 [-56859396-pool-2-thread-2][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L123 - On update event received for resource id: ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'} type: StatefulSet version: 4787803813 old version: 4787803798 
2024-03-20 17:37:45.295 [-56859396-pool-2-thread-2][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L183 - Resource found in temporal cache for id: ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'} resource versions equal: true
2024-03-20 17:37:45.295 [-56859396-pool-2-thread-2][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L154 - Skipping event propagation for UPDATE, since was a result of a reconcile action. Resource ID: ResourceID{name='os-osfarm-cluster-manager', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.318 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.kubernetes.KubernetesDependentResource#L139 - Resource version after update: 4787803819
2024-03-20 17:37:45.318 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L115 - Resource not found in temporary cache reading it from informer cache, for Resource ID: ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.318 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.ManagedInformerEventSource#L118 - Resource found in cache: true for id: ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.318 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.TemporaryResourceCache#L91 - Temporarily moving ahead to target version 4787803819 for resource id: ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.319 [pool-5-thread-9][][] DEBUG io.javaoperatorsdk.operator.processing.dependent.workflow.WorkflowReconcileExecutor#L148 - Setting already reconciled for: DependentResourceNode{com.navercorp.opensearch.operator.api.v1.cr.dependent.sts.OpenSearchNodeStatefulSet@412cc1cd} primaryID: ResourceID{name='osfarm', namespace='n3r-platform-opensearch'}
2024-03-20 17:37:45.319 [-56859396-pool-2-thread-4][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L123 - On update event received for resource id: ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'} type: StatefulSet version: 4787803819 old version: 4787803801 
2024-03-20 17:37:45.319 [-56859396-pool-2-thread-4][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L183 - Resource found in temporal cache for id: ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'} resource versions equal: true
2024-03-20 17:37:45.319 [-56859396-pool-2-thread-4][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L154 - Skipping event propagation for UPDATE, since was a result of a reconcile action. Resource ID: ResourceID{name='os-osfarm-data-hot', namespace='n3r-platform-opensearch'}

anyway
resourceVersion keeps going up, and there's an infinite reconnection problem..
I'll check this out a bit more

@shawkins
Copy link
Collaborator

This log shows 5 updates of 5 different resources by the operator sdk. The operator then skips propogating the update event for all of them, so this appears to be expected behavior. Can you find a situation in your logs where "Propagating event for ..." appears - the surrounding logs should help determine if that resourceVersion of that resource was the result of the operator in which case there would be an event cycle.

@10000-ki
Copy link
Author

2024-03-22 15:05:37.065 [-1777591638-pool-2-thread-1][][] DEBUG io.javaoperatorsdk.operator.processing.event.source.informer.InformerEventSource#L164 - Propagating event for UPDATE, resource with same version not result of a reconciliation. Resource ID: ResourceID{name='os-opensearch-jihea-coordinator', namespace='n3r-project-jihea-g'}

@shawkins
yes i found

@10000-ki
Copy link
Author

스크린샷 2024-03-22 오후 3 36 59

@shawkins
i think
The spec is the same, but only the resourceVersion keeps changing
The event seems to keep coming in without skipping.

@10000-ki
Copy link
Author

10000-ki commented Mar 22, 2024

스크린샷 2024-03-22 오후 3 41 38

스크린샷 2024-03-22 오후 3 51 19

New Object

Deployment(apiVersion=apps/v1, kind=Deployment, metadata=ObjectMeta(annotations={deployment.kubernetes.io/revision=1, javaoperatorsdk.io/previous=6df7cf61-ce12-46eb-9fc7-b4d1daec9ab9,4809814242, n3r.navercorp.com/created-by=system:serviceaccount:n3r-platform-opensearch:opensearch-operator, n3r.navercorp.com/last-modification-timestamp=2024-03-22T15:30:21+09:00, n3r.navercorp.com/last-modified-by=system:serviceaccount:n3r-platform-opensearch:opensearch-operator}, creationTimestamp=2024-03-20T08:42:10Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[], generateName=null, generation=346, labels={app.kubernetes.io/managed-by=opensearch-operator-v1, app.kubernetes.io/part-of=young-instance-1st, n3r.navercorp.com/instance=young-instance-1st, n3r.navercorp.com/platform=opensearch-farm, os.n3r.navercorp.com/kind=opensearch-dashboards}, managedFields=[ManagedFieldsEntry(apiVersion=apps/v1, fieldsType=FieldsV1, fieldsV1=FieldsV1(additionalProperties={f:metadata={f:annotations={f:javaoperatorsdk.io/previous={}}, f:labels={f:app.kubernetes.io/managed-by={}, f:app.kubernetes.io/part-of={}, f:n3r.navercorp.com/instance={}, f:n3r.navercorp.com/platform={}, f:os.n3r.navercorp.com/kind={}}, f:ownerReferences={k:{"uid":"b1eac04b-cf10-4a41-a9c7-f1d5bcc08b65"}={}}}, f:spec={f:replicas={}, f:selector={}, f:template={f:metadata={f:annotations={f:version={}}, f:labels={f:n3r.navercorp.com/instance={}, f:n3r.navercorp.com/platform={}, f:os.n3r.navercorp.com/kind={}}}, f:spec={f:containers={k:{"name":"opensearch-dashboards"}={.={}, f:env={k:{"name":"DISABLE_SECURITY_DASHBOARDS_PLUGIN"}={.={}, f:name={}, f:value={}}, k:{"name":"OPENSEARCH_HOSTS"}={.={}, f:name={}, f:value={}}}, f:image={}, f:imagePullPolicy={}, f:livenessProbe={f:failureThreshold={}, f:initialDelaySeconds={}, f:periodSeconds={}, f:successThreshold={}, f:tcpSocket={f:port={}}, f:timeoutSeconds={}}, f:name={}, f:ports={k:{"containerPort":5601,"protocol":"TCP"}={.={}, f:containerPort={}}}, f:resources={f:limits={f:cpu={}, f:memory={}}, f:requests={f:cpu={}, f:memory={}}}, f:startupProbe={f:failureThreshold={}, f:initialDelaySeconds={}, f:periodSeconds={}, f:successThreshold={}, f:tcpSocket={f:port={}}, f:timeoutSeconds={}}, f:volumeMounts={k:{"mountPath":"/home1/irteam/apps/opensearch-dashboards/config/opensearch_dashboards.yml"}={.={}, f:mountPath={}, f:name={}, f:subPath={}}}}}, f:restartPolicy={}, f:volumes={k:{"name":"config"}={.={}, f:configMap={f:name={}}, f:name={}}}}}}}), manager=opensearchreconciler, operation=Apply, subresource=null, time=2024-03-22T06:39:50Z, additionalProperties={}), ManagedFieldsEntry(apiVersion=apps/v1, fieldsType=FieldsV1, fieldsV1=FieldsV1(additionalProperties={f:metadata={f:annotations={f:deployment.kubernetes.io/revision={}}}, f:status={f:availableReplicas={}, f:conditions={.={}, k:{"type":"Available"}={.={}, f:lastTransitionTime={}, f:lastUpdateTime={}, f:message={}, f:reason={}, f:status={}, f:type={}}, k:{"type":"Progressing"}={.={}, f:lastTransitionTime={}, f:lastUpdateTime={}, f:message={}, f:reason={}, f:status={}, f:type={}}}, f:observedGeneration={}, f:readyReplicas={}, f:replicas={}, f:updatedReplicas={}}}), manager=kube-controller-manager, operation=Update, subresource=status, time=2024-03-21T07:40:54Z, additionalProperties={})], name=os-young-instance-1st-dashboards, namespace=n3r-project-young-test, ownerReferences=[OwnerReference(apiVersion=n3r.navercorp.com/v1, blockOwnerDeletion=null, controller=null, kind=Opensearch, name=young-instance-1st, uid=b1eac04b-cf10-4a41-a9c7-f1d5bcc08b65, additionalProperties={})], resourceVersion=4809862625, selfLink=null, uid=534cfaca-e62f-4fe0-966c-17ec8d0d2dbd, additionalProperties={}), spec=DeploymentSpec(minReadySeconds=null, paused=null, progressDeadlineSeconds=600, replicas=1, revisionHistoryLimit=10, selector=LabelSelector(matchExpressions=[], matchLabels={n3r.navercorp.com/instance=young-instance-1st, n3r.navercorp.com/platform=opensearch-farm, os.n3r.navercorp.com/kind=opensearch-dashboards}, additionalProperties={}), strategy=DeploymentStrategy(rollingUpdate=RollingUpdateDeployment(maxSurge=AnyType(value=25%), maxUnavailable=AnyType(value=25%), additionalProperties={}), type=RollingUpdate, additionalProperties={}), template=PodTemplateSpec(metadata=ObjectMeta(annotations={version=2.11.1}, creationTimestamp=null, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[], generateName=null, generation=null, labels={n3r.navercorp.com/instance=young-instance-1st, n3r.navercorp.com/platform=opensearch-farm, os.n3r.navercorp.com/kind=opensearch-dashboards}, managedFields=[], name=null, namespace=null, ownerReferences=[], resourceVersion=null, selfLink=null, uid=null, additionalProperties={}), spec=PodSpec(activeDeadlineSeconds=null, affinity=null, automountServiceAccountToken=null, containers=[Container(args=[], command=[], env=[EnvVar(name=OPENSEARCH_HOSTS, value=http://os-young-instance-1st-discovery.n3r-project-young-test.svc:10200, valueFrom=null, additionalProperties={}), EnvVar(name=DISABLE_SECURITY_DASHBOARDS_PLUGIN, value=true, valueFrom=null, additionalProperties={}), EnvVar(name=N3R_CLUSTER_NAME, value=ai1, valueFrom=null, additionalProperties={}), EnvVar(name=N3R_CLUSTER_NETWORK, value=develop, valueFrom=null, additionalProperties={}), EnvVar(name=N3R_CLUSTER_PHASE, value=develop, valueFrom=null, additionalProperties={}), EnvVar(name=TZ, value=Asia/Seoul, valueFrom=null, additionalProperties={})], envFrom=[], image=reg.navercorp.com/opensearch/opensearch-dashboards:2.11.1, imagePullPolicy=Always, lifecycle=null, livenessProbe=Probe(exec=null, failureThreshold=3, grpc=null, httpGet=null, initialDelaySeconds=5, periodSeconds=10, successThreshold=1, tcpSocket=TCPSocketAction(host=null, port=AnyType(value=5601), additionalProperties={}), terminationGracePeriodSeconds=null, timeoutSeconds=3, additionalProperties={}), name=opensearch-dashboards, ports=[ContainerPort(containerPort=5601, hostIP=null, hostPort=null, name=null, protocol=TCP, additionalProperties={})], readinessProbe=null, resizePolicy=[], resources=ResourceRequirements(claims=[], limits={cpu=4, memory=4Gi, n3r/type.any=1}, requests={cpu=4, memory=4Gi, n3r/type.any=1}, additionalProperties={}), restartPolicy=null, securityContext=null, startupProbe=Probe(exec=null, failureThreshold=10, grpc=null, httpGet=null, initialDelaySeconds=10, periodSeconds=10, successThreshold=1, tcpSocket=TCPSocketAction(host=null, port=AnyType(value=5601), additionalProperties={}), terminationGracePeriodSeconds=null, timeoutSeconds=3, additionalProperties={}), stdin=null, stdinOnce=null, terminationMessagePath=/dev/termination-log, terminationMessagePolicy=File, tty=null, volumeDevices=[], volumeMounts=[VolumeMount(mountPath=/home1/irteam/apps/opensearch-dashboards/config/opensearch_dashboards.yml, mountPropagation=null, name=config, readOnly=null, subPath=opensearch_dashboards.yml, subPathExpr=null, additionalProperties={})], workingDir=null, additionalProperties={})], dnsConfig=null, dnsPolicy=ClusterFirst, enableServiceLinks=null, ephemeralContainers=[], hostAliases=[], hostIPC=null, hostNetwork=null, hostPID=null, hostUsers=null, hostname=null, imagePullSecrets=[], initContainers=[], nodeName=null, nodeSelector={}, os=null, overhead={}, preemptionPolicy=null, priority=null, priorityClassName=null, readinessGates=[], resourceClaims=[], restartPolicy=Always, runtimeClassName=null, schedulerName=default-scheduler, schedulingGates=[], securityContext=PodSecurityContext(fsGroup=null, fsGroupChangePolicy=null, runAsGroup=null, runAsNonRoot=null, runAsUser=null, seLinuxOptions=null, seccompProfile=null, supplementalGroups=[], sysctls=[], windowsOptions=null, additionalProperties={}), serviceAccount=null, serviceAccountName=null, setHostnameAsFQDN=null, shareProcessNamespace=null, subdomain=null, terminationGracePeriodSeconds=30, tolerations=[], topologySpreadConstraints=[], volumes=[Volume(awsElasticBlockStore=null, azureDisk=null, azureFile=null, cephfs=null, cinder=null, configMap=ConfigMapVolumeSource(defaultMode=420, items=[], name=os-young-instance-1st-dashboards-configmap, optional=null, additionalProperties={}), csi=null, downwardAPI=null, emptyDir=null, ephemeral=null, fc=null, flexVolume=null, flocker=null, gcePersistentDisk=null, gitRepo=null, glusterfs=null, hostPath=null, iscsi=null, name=config, nfs=null, persistentVolumeClaim=null, photonPersistentDisk=null, portworxVolume=null, projected=null, quobyte=null, rbd=null, scaleIO=null, secret=null, storageos=null, vsphereVolume=null, additionalProperties={})], additionalProperties={}), additionalProperties={}), additionalProperties={}), status=DeploymentStatus(availableReplicas=1, collisionCount=null, conditions=[DeploymentCondition(lastTransitionTime=2024-03-20T08:42:10Z, lastUpdateTime=2024-03-20T08:42:41Z, message=ReplicaSet "os-young-instance-1st-dashboards-68c98979d6" has successfully progressed., reason=NewReplicaSetAvailable, status=True, type=Progressing, additionalProperties={}), DeploymentCondition(lastTransitionTime=2024-03-21T07:40:54Z, lastUpdateTime=2024-03-21T07:40:54Z, message=Deployment has minimum availability., reason=MinimumReplicasAvailable, status=True, type=Available, additionalProperties={})], observedGeneration=346, readyReplicas=1, replicas=1, unavailableReplicas=null, updatedReplicas=1, additionalProperties={}), additionalProperties={})

Old Object

Deployment(apiVersion=apps/v1, kind=Deployment, metadata=ObjectMeta(annotations={deployment.kubernetes.io/revision=1, javaoperatorsdk.io/previous=6df7cf61-ce12-46eb-9fc7-b4d1daec9ab9,4809814242, n3r.navercorp.com/created-by=system:serviceaccount:n3r-platform-opensearch:opensearch-operator, n3r.navercorp.com/last-modification-timestamp=2024-03-22T15:30:21+09:00, n3r.navercorp.com/last-modified-by=system:serviceaccount:n3r-platform-opensearch:opensearch-operator}, creationTimestamp=2024-03-20T08:42:10Z, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[], generateName=null, generation=346, labels={app.kubernetes.io/managed-by=opensearch-operator-v1, app.kubernetes.io/part-of=young-instance-1st, n3r.navercorp.com/instance=young-instance-1st, n3r.navercorp.com/platform=opensearch-farm, os.n3r.navercorp.com/kind=opensearch-dashboards}, managedFields=[ManagedFieldsEntry(apiVersion=apps/v1, fieldsType=FieldsV1, fieldsV1=FieldsV1(additionalProperties={f:metadata={f:annotations={f:javaoperatorsdk.io/previous={}}, f:labels={f:app.kubernetes.io/managed-by={}, f:app.kubernetes.io/part-of={}, f:n3r.navercorp.com/instance={}, f:n3r.navercorp.com/platform={}, f:os.n3r.navercorp.com/kind={}}, f:ownerReferences={k:{"uid":"b1eac04b-cf10-4a41-a9c7-f1d5bcc08b65"}={}}}, f:spec={f:replicas={}, f:selector={}, f:template={f:metadata={f:annotations={f:version={}}, f:labels={f:n3r.navercorp.com/instance={}, f:n3r.navercorp.com/platform={}, f:os.n3r.navercorp.com/kind={}}}, f:spec={f:containers={k:{"name":"opensearch-dashboards"}={.={}, f:env={k:{"name":"DISABLE_SECURITY_DASHBOARDS_PLUGIN"}={.={}, f:name={}, f:value={}}, k:{"name":"OPENSEARCH_HOSTS"}={.={}, f:name={}, f:value={}}}, f:image={}, f:imagePullPolicy={}, f:livenessProbe={f:failureThreshold={}, f:initialDelaySeconds={}, f:periodSeconds={}, f:successThreshold={}, f:tcpSocket={f:port={}}, f:timeoutSeconds={}}, f:name={}, f:ports={k:{"containerPort":5601,"protocol":"TCP"}={.={}, f:containerPort={}}}, f:resources={f:limits={f:cpu={}, f:memory={}}, f:requests={f:cpu={}, f:memory={}}}, f:startupProbe={f:failureThreshold={}, f:initialDelaySeconds={}, f:periodSeconds={}, f:successThreshold={}, f:tcpSocket={f:port={}}, f:timeoutSeconds={}}, f:volumeMounts={k:{"mountPath":"/home1/irteam/apps/opensearch-dashboards/config/opensearch_dashboards.yml"}={.={}, f:mountPath={}, f:name={}, f:subPath={}}}}}, f:restartPolicy={}, f:volumes={k:{"name":"config"}={.={}, f:configMap={f:name={}}, f:name={}}}}}}}), manager=opensearchreconciler, operation=Apply, subresource=null, time=2024-03-22T06:39:50Z, additionalProperties={}), ManagedFieldsEntry(apiVersion=apps/v1, fieldsType=FieldsV1, fieldsV1=FieldsV1(additionalProperties={f:metadata={f:annotations={f:deployment.kubernetes.io/revision={}}}, f:status={f:availableReplicas={}, f:conditions={.={}, k:{"type":"Available"}={.={}, f:lastTransitionTime={}, f:lastUpdateTime={}, f:message={}, f:reason={}, f:status={}, f:type={}}, k:{"type":"Progressing"}={.={}, f:lastTransitionTime={}, f:lastUpdateTime={}, f:message={}, f:reason={}, f:status={}, f:type={}}}, f:observedGeneration={}, f:readyReplicas={}, f:replicas={}, f:updatedReplicas={}}}), manager=kube-controller-manager, operation=Update, subresource=status, time=2024-03-21T07:40:54Z, additionalProperties={})], name=os-young-instance-1st-dashboards, namespace=n3r-project-young-test, ownerReferences=[OwnerReference(apiVersion=n3r.navercorp.com/v1, blockOwnerDeletion=null, controller=null, kind=Opensearch, name=young-instance-1st, uid=b1eac04b-cf10-4a41-a9c7-f1d5bcc08b65, additionalProperties={})], resourceVersion=4809862620, selfLink=null, uid=534cfaca-e62f-4fe0-966c-17ec8d0d2dbd, additionalProperties={}), spec=DeploymentSpec(minReadySeconds=null, paused=null, progressDeadlineSeconds=600, replicas=1, revisionHistoryLimit=10, selector=LabelSelector(matchExpressions=[], matchLabels={n3r.navercorp.com/instance=young-instance-1st, n3r.navercorp.com/platform=opensearch-farm, os.n3r.navercorp.com/kind=opensearch-dashboards}, additionalProperties={}), strategy=DeploymentStrategy(rollingUpdate=RollingUpdateDeployment(maxSurge=AnyType(value=25%), maxUnavailable=AnyType(value=25%), additionalProperties={}), type=RollingUpdate, additionalProperties={}), template=PodTemplateSpec(metadata=ObjectMeta(annotations={version=2.11.1}, creationTimestamp=null, deletionGracePeriodSeconds=null, deletionTimestamp=null, finalizers=[], generateName=null, generation=null, labels={n3r.navercorp.com/instance=young-instance-1st, n3r.navercorp.com/platform=opensearch-farm, os.n3r.navercorp.com/kind=opensearch-dashboards}, managedFields=[], name=null, namespace=null, ownerReferences=[], resourceVersion=null, selfLink=null, uid=null, additionalProperties={}), spec=PodSpec(activeDeadlineSeconds=null, affinity=null, automountServiceAccountToken=null, containers=[Container(args=[], command=[], env=[EnvVar(name=OPENSEARCH_HOSTS, value=http://os-young-instance-1st-discovery.n3r-project-young-test.svc:10200, valueFrom=null, additionalProperties={}), EnvVar(name=DISABLE_SECURITY_DASHBOARDS_PLUGIN, value=true, valueFrom=null, additionalProperties={}), EnvVar(name=N3R_CLUSTER_NAME, value=ai1, valueFrom=null, additionalProperties={}), EnvVar(name=N3R_CLUSTER_NETWORK, value=develop, valueFrom=null, additionalProperties={}), EnvVar(name=N3R_CLUSTER_PHASE, value=develop, valueFrom=null, additionalProperties={}), EnvVar(name=TZ, value=Asia/Seoul, valueFrom=null, additionalProperties={})], envFrom=[], image=reg.navercorp.com/opensearch/opensearch-dashboards:2.11.1, imagePullPolicy=Always, lifecycle=null, livenessProbe=Probe(exec=null, failureThreshold=3, grpc=null, httpGet=null, initialDelaySeconds=5, periodSeconds=10, successThreshold=1, tcpSocket=TCPSocketAction(host=null, port=AnyType(value=5601), additionalProperties={}), terminationGracePeriodSeconds=null, timeoutSeconds=3, additionalProperties={}), name=opensearch-dashboards, ports=[ContainerPort(containerPort=5601, hostIP=null, hostPort=null, name=null, protocol=TCP, additionalProperties={})], readinessProbe=null, resizePolicy=[], resources=ResourceRequirements(claims=[], limits={cpu=4, memory=4Gi, n3r/type.any=1}, requests={cpu=4, memory=4Gi, n3r/type.any=1}, additionalProperties={}), restartPolicy=null, securityContext=null, startupProbe=Probe(exec=null, failureThreshold=10, grpc=null, httpGet=null, initialDelaySeconds=10, periodSeconds=10, successThreshold=1, tcpSocket=TCPSocketAction(host=null, port=AnyType(value=5601), additionalProperties={}), terminationGracePeriodSeconds=null, timeoutSeconds=3, additionalProperties={}), stdin=null, stdinOnce=null, terminationMessagePath=/dev/termination-log, terminationMessagePolicy=File, tty=null, volumeDevices=[], volumeMounts=[VolumeMount(mountPath=/home1/irteam/apps/opensearch-dashboards/config/opensearch_dashboards.yml, mountPropagation=null, name=config, readOnly=null, subPath=opensearch_dashboards.yml, subPathExpr=null, additionalProperties={})], workingDir=null, additionalProperties={})], dnsConfig=null, dnsPolicy=ClusterFirst, enableServiceLinks=null, ephemeralContainers=[], hostAliases=[], hostIPC=null, hostNetwork=null, hostPID=null, hostUsers=null, hostname=null, imagePullSecrets=[], initContainers=[], nodeName=null, nodeSelector={}, os=null, overhead={}, preemptionPolicy=null, priority=null, priorityClassName=null, readinessGates=[], resourceClaims=[], restartPolicy=Always, runtimeClassName=null, schedulerName=default-scheduler, schedulingGates=[], securityContext=PodSecurityContext(fsGroup=null, fsGroupChangePolicy=null, runAsGroup=null, runAsNonRoot=null, runAsUser=null, seLinuxOptions=null, seccompProfile=null, supplementalGroups=[], sysctls=[], windowsOptions=null, additionalProperties={}), serviceAccount=null, serviceAccountName=null, setHostnameAsFQDN=null, shareProcessNamespace=null, subdomain=null, terminationGracePeriodSeconds=30, tolerations=[], topologySpreadConstraints=[], volumes=[Volume(awsElasticBlockStore=null, azureDisk=null, azureFile=null, cephfs=null, cinder=null, configMap=ConfigMapVolumeSource(defaultMode=420, items=[], name=os-young-instance-1st-dashboards-configmap, optional=null, additionalProperties={}), csi=null, downwardAPI=null, emptyDir=null, ephemeral=null, fc=null, flexVolume=null, flocker=null, gcePersistentDisk=null, gitRepo=null, glusterfs=null, hostPath=null, iscsi=null, name=config, nfs=null, persistentVolumeClaim=null, photonPersistentDisk=null, portworxVolume=null, projected=null, quobyte=null, rbd=null, scaleIO=null, secret=null, storageos=null, vsphereVolume=null, additionalProperties={})], additionalProperties={}), additionalProperties={}), additionalProperties={}), status=DeploymentStatus(availableReplicas=1, collisionCount=null, conditions=[DeploymentCondition(lastTransitionTime=2024-03-20T08:42:10Z, lastUpdateTime=2024-03-20T08:42:41Z, message=ReplicaSet "os-young-instance-1st-dashboards-68c98979d6" has successfully progressed., reason=NewReplicaSetAvailable, status=True, type=Progressing, additionalProperties={}), DeploymentCondition(lastTransitionTime=2024-03-21T07:40:54Z, lastUpdateTime=2024-03-21T07:40:54Z, message=Deployment has minimum availability., reason=MinimumReplicasAvailable, status=True, type=Available, additionalProperties={})], observedGeneration=345, readyReplicas=1, replicas=1, unavailableReplicas=null, updatedReplicas=1, additionalProperties={}), additionalProperties={})

I don't know why resourceVersion is going up.
The spec is exactly the same

I changed it to csa, but there was no difference.

shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Mar 22, 2024
closes: operator-framework#2290

Signed-off-by: Steven Hawkins <shawkins@redhat.com>
@shawkins
Copy link
Collaborator

I don't know why resourceVersion is going up.

resourceVersions change for any modification, not just the spec. It's the generation that typically increments with spec changes.

The primary problem is the one captured here: #2249 - if the matcher fails to recognize that the resource is the same, then it adds the new annotation value, which due to a kube bug will increment the Deployment generation, and the cycle repeats.

The SSA logic is not trying to be generation aware in this case, but sees the generation change as something you are requesting in the desired state - which is not correct.

#2309 will address this specific case by ignoring all server managed fields in the comparison - this is similar to what the kubernetes client does in the PatchUtils class when generating a patch.

@csviri not sure if that is sufficient to fully close #2249 as the doc changes would be good too.

I changed it to csa, but there was no difference.

What do you mean?

@10000-ki
Copy link
Author

10000-ki commented Mar 25, 2024

The primary problem is the one captured here: #2249 - if the matcher fails to recognize that the resource is the same, then it adds the new annotation value, which due to a kube bug will increment the Deployment generation, and the cycle repeats.
The SSA logic is not trying to be generation aware in this case, but sees the generation change as something you are requesting in the desired state - which is not correct.
#2309 will address this specific case by ignoring all server managed fields in the comparison - this is similar to what the kubernetes client does in the PatchUtils class when generating a patch.

@shawkins oh i see..
thank you for explaining

I changed it to csa, but there was no difference.

https://javaoperatorsdk.io/docs/dependent-resources#comparing-desired-and-actual-state-matching

WARNING: Older versions of Kubernetes before 1.25 would create an additional resource version for every SSA update performed with certain resources - even though there were no actual changes in the stored resource - leading to infinite reconciliations. This behavior was seen with Secrets using stringData, Ingresses using empty string fields, and StatefulSets using volume claim templates. The operator framework has added built-in handling for the StatefulSet issue. If you encounter this issue on an older Kubernetes version, consider changing your desired state, turning off SSA for that resource, or even upgrading your Kubernetes version. If you encounter it on a newer Kubernetes version, please log an issue with the JOSDK and with upstream Kubernetes.

i found this guide, and turned SSA off
but it wasn't work well

@csviri
Copy link
Collaborator

csviri commented Mar 25, 2024

@csviri not sure if that is sufficient to fully close #2249 as the doc changes would be good too.

still thinking of revisiting the other algorithm with event recording, thus providing both, but I will get back to that a bit later.

@10000-ki
Copy link
Author

10000-ki commented Mar 28, 2024

@shawkins @csviri hello

After �i upgrade to v4.8.2

스크린샷 2024-03-29 오전 12 18 52

javaoperatorsdk.io/previous label value keeps changing

스크린샷 2024-03-29 오전 12 20 23

Is the k8s version bug the reason resourceversion keeps increasing?

What's interesting is that this issue does not occur in josdk version v4.4.4 (under v4.5.0)

@csviri
Copy link
Collaborator

csviri commented Mar 28, 2024

@10000-ki can you create a simple reproducer pls?

( Ideally with 1 deployment dependent )

@shawkins
Copy link
Collaborator

@csviri @10000-ki with the full yaml it's clear that there's another matching issue in play. The SSA matching logic sees the n3r.navercorp.com annotations as part of the desired state, but does not think that they are owned by the operator, thus it's trying the operation again.

@shawkins
Copy link
Collaborator

@10000-ki I should have realized with the previous incarnation of this issue that you were creating the desired state on top of the existing state - are you doing something like new StatefulSetBuilder(existing)... ?

That is why the desired state had the generation set. Here with the state from the other field managers populated you see a similar issue.

The SSA matching logic expects you to only provide your managed desired state when creating the entry - and nothing else.

@csviri is it worth guarding against this more?

@10000-ki
Copy link
Author

10000-ki commented Mar 29, 2024

@shawkins

with the full yaml it's clear that there's another matching issue in play. The SSA matching logic sees the n3r.navercorp.com annotations as part of the desired state, but does not think that they are owned by the operator, thus it's trying the operation again.

The problem is the same even if we remove the n3r.navercorp.com annotations

스크린샷 2024-03-29 오후 3 11 29

When i change the previousAnnotationForDependentResources setting to false
Finally, the problem is solved.

@shawkins
Copy link
Collaborator

The problem is the same even if we remove the n3r.navercorp.com annotations

Please don't just remove the annotations, construct your desired state starting with a fresh object and don't copy anything from one obtained by the server - the point of SSA is that the server will figure out which manager is managing which fields.

If the problem still persists after that it would be good to determine where the matching is failing. At the debug level, you should see logs like:

2024-03-29 06:59:21,095 1 i.j.o.p.d.k.SSABasedGenericKubernetesResourceMatcher [DEBUG] Pruned actual:
...

That will show what the matcher thinks your desired state is vs what is on the api server. Any descrepency between the two will cause another reconciliation. We can determine from there if there is SSA / Kube issue in play (that requires a workaround in the operator sdk) or if it's a problem with the matching logic.

When i change the previousAnnotationForDependentResources setting to false

That is expected. This optimization is based upon the matching logic working as intended. As mentioned on #2249 it is probably fine to just default this to false and allow users to opt in if they want greater performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants