Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LeaderElection related bug on version 4.0.3 #1585

Closed
hoony9x opened this issue Oct 31, 2022 · 11 comments
Closed

LeaderElection related bug on version 4.0.3 #1585

hoony9x opened this issue Oct 31, 2022 · 11 comments
Assignees
Milestone

Comments

@hoony9x
Copy link

hoony9x commented Oct 31, 2022

Bug Report

What did you do?

I just tried to stop running operator.

  • It is integrated with Spring Boot 2.7.3
  • Version of java-operator-sdk is 4.0.3.
  • It has a leader election configuration like below.
val operator = Operator(kubernetesClient) {
    it.withLeaderElectionConfiguration(
        LeaderElectionConfiguration(
            LEASE_NAME,
            POD_NAMESPACE,
            POD_NAME,
        ),
    )
}

What did you expect to see?

Should be shut down without error message like below. (Below logs are generated from Operator SDK 3.2.3)

2022-10-31 17:06:51.409  INFO 32006 --- [ionShutdownHook] io.javaoperatorsdk.operator.Operator     : Operator SDK 3.2.3 is shutting down...
2022-10-31 17:06:51.409  INFO 32006 --- [ionShutdownHook] i.j.o.p.e.s.informer.InformerManager     : Stopping informer io.javaoperatorsdk.operator.processing.event.source.informer.InformerManager@3878a961 -> io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper@718e157f
2022-10-31 17:06:51.410  INFO 32006 --- [ionShutdownHook] i.j.o.p.e.s.informer.InformerManager     : Stopping informer io.javaoperatorsdk.operator.processing.event.source.informer.InformerManager@6891ce7c -> io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper@5381736d
2022-10-31 17:06:51.410  INFO 32006 --- [ionShutdownHook] i.j.o.p.e.s.informer.InformerManager     : Stopping informer io.javaoperatorsdk.operator.processing.event.source.informer.InformerManager@1fc3b1b9 -> io.javaoperatorsdk.operator.processing.event.source.informer.InformerWrapper@768877d5

What did you see instead? Under which circumstances?

Operator framework throws an error message "Unable to update LeaseLock" like below.

2022-10-31 16:51:55.476  INFO 30444 --- [ionShutdownHook] io.javaoperatorsdk.operator.Operator     : Operator SDK 4.0.3 is shutting down...
2022-10-31 16:51:55.476  INFO 30444 --- [ionShutdownHook] i.j.o.p.e.s.informer.InformerManager     : Stopping InformerManager [samples.sample.navercorp.com/v1alpha1] watching: [JOSDK_ALL_NAMESPACES] selector: 
2022-10-31 16:51:55.478  INFO 30444 --- [ionShutdownHook] i.j.o.p.e.s.informer.InformerManager     : Stopping InformerManager [configmaps/v1] watching: [JOSDK_ALL_NAMESPACES] selector: app.kubernetes.io/managed-by=sample-operator-20220929-181845
2022-10-31 16:51:55.478  INFO 30444 --- [ionShutdownHook] i.j.o.p.e.s.informer.InformerManager     : Stopping InformerManager [statefulsets.apps/v1] watching: [JOSDK_ALL_NAMESPACES] selector: app.kubernetes.io/managed-by=sample-operator-20220929-181845
2022-10-31 16:51:55.758 ERROR 30444 --- [ionShutdownHook] i.f.k.c.e.leaderelection.LeaderElector   : Exception occurred while releasing lock 'LeaseLock: sample-operator - sample-operator-leader-election (local)'

io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException: Unable to update LeaseLock
	at io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:102) ~[kubernetes-client-api-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.release(LeaderElector.java:139) ~[kubernetes-client-api-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.stopLeading(LeaderElector.java:120) ~[kubernetes-client-api-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$null$1(LeaderElector.java:94) ~[kubernetes-client-api-6.2.0.jar:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2480) ~[na:na]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$null$0(LeaderElector.java:92) ~[kubernetes-client-api-6.2.0.jar:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2480) ~[na:na]
	at io.javaoperatorsdk.operator.LeaderElectionManager.stop(LeaderElectionManager.java:98) ~[operator-framework-core-4.0.3.jar:na]
	at io.javaoperatorsdk.operator.Operator.stop(Operator.java:118) ~[operator-framework-core-4.0.3.jar:na]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) ~[na:na]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
	at java.base/java.lang.reflect.Method.invoke(Method.java:568) ~[na:na]
	at org.springframework.beans.factory.support.DisposableBeanAdapter.invokeCustomDestroyMethod(DisposableBeanAdapter.java:319) ~[spring-beans-5.3.21.jar:5.3.21]
	at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:253) ~[spring-beans-5.3.21.jar:5.3.21]
	at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:587) ~[spring-beans-5.3.21.jar:5.3.21]
	at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:559) ~[spring-beans-5.3.21.jar:5.3.21]
	at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:1163) ~[spring-beans-5.3.21.jar:5.3.21]
	at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:520) ~[spring-beans-5.3.21.jar:5.3.21]
	at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:1156) ~[spring-beans-5.3.21.jar:5.3.21]
	at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1106) ~[spring-context-5.3.21.jar:5.3.21]
	at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1075) ~[spring-context-5.3.21.jar:5.3.21]
	at org.springframework.boot.web.reactive.context.ReactiveWebServerApplicationContext.doClose(ReactiveWebServerApplicationContext.java:147) ~[spring-boot-2.6.9.jar:2.6.9]
	at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:1021) ~[spring-context-5.3.21.jar:5.3.21]
	at org.springframework.boot.SpringApplicationShutdownHook.closeAndWait(SpringApplicationShutdownHook.java:145) ~[spring-boot-2.6.9.jar:2.6.9]
	at java.base/java.lang.Iterable.forEach(Iterable.java:75) ~[na:na]
	at org.springframework.boot.SpringApplicationShutdownHook.run(SpringApplicationShutdownHook.java:114) ~[spring-boot-2.6.9.jar:2.6.9]
	at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.105.203.43:6443/apis/coordination.k8s.io/v1/namespaces/sample-operator/leases/sample-operator-leader-election. Message: Operation cannot be fulfilled on leases.coordination.k8s.io "sample-operator-leader-election": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=coordination.k8s.io, kind=leases, name=sample-operator-leader-election, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on leases.coordination.k8s.io "sample-operator-leader-election": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238) ~[kubernetes-client-api-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:517) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:551) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:347) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:680) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:167) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:172) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:113) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:41) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.replace(BaseOperation.java:1043) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.replace(BaseOperation.java:88) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:100) ~[kubernetes-client-api-6.2.0.jar:na]
	... 33 common frames omitted
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.105.203.43:6443/apis/coordination.k8s.io/v1/namespaces/sample-operator/leases/sample-operator-leader-election. Message: Operation cannot be fulfilled on leases.coordination.k8s.io "sample-operator-leader-election": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=coordination.k8s.io, kind=leases, name=sample-operator-leader-election, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on leases.coordination.k8s.io "sample-operator-leader-election": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:709) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:689) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:640) ~[kubernetes-client-6.2.0.jar:na]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:576) ~[kubernetes-client-6.2.0.jar:na]
	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[na:na]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$retryWithExponentialBackoff$2(OperationSupport.java:618) ~[kubernetes-client-6.2.0.jar:na]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[na:na]
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[na:na]
	at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$4.onResponse(OkHttpClientImpl.java:277) ~[kubernetes-httpclient-okhttp-6.2.0.jar:na]
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174) ~[okhttp-3.14.9.jar:na]
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) ~[okhttp-3.14.9.jar:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
	... 1 common frames omitted

2022-10-31 16:51:55.758  INFO 30444 --- [ionShutdownHook] i.j.operator.LeaderElectionManager       : Stopped leading for identity: local. Exiting.

In above state, operator is still in STOPPING state.

  • It is not completely stopped.
  • Therefore, operator Pod can't be restarted immediately.

Environment

Kubernetes cluster type:

$ Mention java-operator-sdk version from pom.xml file

  • 4.0.3

$ java -version

openjdk version "17.0.5" 2022-10-18
OpenJDK Runtime Environment Temurin-17.0.5+8 (build 17.0.5+8)
OpenJDK 64-Bit Server VM Temurin-17.0.5+8 (build 17.0.5+8, mixed mode)

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.1", GitCommit:"e4d4e1ab7cf1bf15273ef97303551b279f0920a9", GitTreeState:"clean", BuildDate:"2022-09-14T19:49:27Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.1", GitCommit:"e4d4e1ab7cf1bf15273ef97303551b279f0920a9", GitTreeState:"clean", BuildDate:"2022-09-14T19:42:30Z", GoVersion:"go1.19.1", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

Additional context

@csviri
Copy link
Collaborator

csviri commented Oct 31, 2022

cc @shawkins - this looks like more an issue with the leader election in client, are you aware of such issue?
this looks like a simple optimistic locking error from the client

@csviri csviri added this to the 4.2 milestone Oct 31, 2022
@shawkins
Copy link
Collaborator

@csviri is there just a single elector? Unfortunately it seems possible for this to occur in that case as there's no explicit locking - just a volatile variable to track the latest record. So a cancel could occur in between an update operation and setting the latest record.

@csviri
Copy link
Collaborator

csviri commented Oct 31, 2022

@shawkins thx, I see. So what is happening that the record it not read again on cancel, so there is an optimistic locking issue if someone changed the resource meanwhile. I have some questions:

  1. @hoony9x what do you mean it is not completely stopped? If I understand correctly based in the client:
    https://github.com/fabric8io/kubernetes-client/blob/26b8ce7eec3311d9a4a296ebaf8b3421347856fc/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L122-L125
    The stop callback should be made and in the JOSDK side:
    https://github.com/java-operator-sdk/java-operator-sdk/blob/fb9dc1aeef2a5b0c9faf69286d39f16409f332f7/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/LeaderElectionManager.java#L74-L80
    Should stop the process / pod. By it is " you mean the operator is still running? You mean the pod is still running, or the lock is not removed?

  2. @shawkins would is make sense to refresh the Lock before cancel?

  3. @shawkins if there are two instances ? the other should not do updates on the lock, or it is expected that the other - non leader - does some updates on the lock?

@hoony9x
Copy link
Author

hoony9x commented Oct 31, 2022

@csviri I mean, operator "looks like" stopped, but process is not terminated.

@shawkins
Copy link
Collaborator

@shawkins would is make sense to refresh the Lock before cancel?

No, the leader election logic just needs some synchronization.

@shawkins if there are two instances ? the other should not do updates on the lock, or it is expected that the other - non leader - does some updates on the lock?

It is possible that someone else assumed leadership by the time this call is being made. The exception would be the same, but it wouldn't matter as the leadership is already lost.

@shawkins
Copy link
Collaborator

I mean, operator "looks like" stopped, but process is not terminated.

@hoony9x are you able to take a thread dump and see what non-daemon threads are still running?

@csviri
Copy link
Collaborator

csviri commented Oct 31, 2022

@csviri I mean, operator "looks like" stopped, but process is not terminated.

That is strange as you can see here:
https://github.com/java-operator-sdk/java-operator-sdk/blob/fb9dc1aeef2a5b0c9faf69286d39f16409f332f7/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/LeaderElectionManager.java#L74-L80

System.exit(1) is called right after this log message.
Could you pls check inspect the container status of the pod? And see what is happening there on pod level?
@hoony9x

@csviri
Copy link
Collaborator

csviri commented Oct 31, 2022

No, the leader election logic just needs some synchronization.

By refresh I meant lead the latest version.

@hoony9x
Copy link
Author

hoony9x commented Nov 1, 2022

@shawkins @csviri

I tested it again on my private Kubernetes Cluster and Pod is terminated without delay.

  • Previous result was my mistake.

Errror message I already described above still shows up when I tried to stop operator.

@csviri
Copy link
Collaborator

csviri commented Nov 1, 2022

Errror message I already described above still shows up when I tried to stop operator.

As @shawkins explained that actually might be expected, also see an improvement on this:

fabric8io/kubernetes-client#4548

@csviri
Copy link
Collaborator

csviri commented Nov 30, 2022

Since there was already an improvement on this since, and the mentioned error is expected in some situations, will close this issue, and reopen in case there are still some issues.

@csviri csviri closed this as completed Nov 30, 2022
@csviri csviri self-assigned this Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants