Fix for a failing Kafka operator job #595

fedinskiy · 2022-10-12T15:15:11Z

Summary

Please check the relevant options

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Dependency update
Refactoring
Breaking change (fix or feature that would cause existing functionality to change)
This change requires a documentation update
This change requires execution against OCP (use run tests phrase in comment)

Checklist:

Example scenarios has been updated / added
Methods and classes used in PR scenarios are meaningful
Commits are well encapsulated and follow the best practices

pjgg · 2022-10-13T11:54:00Z

run tests

pjgg

Minor changes or clarifications

quarkus-test-core/src/main/java/io/quarkus/test/bootstrap/BaseService.java

pjgg · 2022-10-13T12:05:30Z

quarkus-test-core/src/main/java/io/quarkus/test/bootstrap/BaseService.java

@@ -41,6 +42,9 @@
    private final Map<String, String> properties = new HashMap<>();
    private final List<Runnable> futureProperties = new LinkedList<>();

+    //todo workaround for https://github.com/fabric8io/kubernetes-client/issues/4491
+    private final AtomicBoolean isStopped = new AtomicBoolean(false);


Did you see any concurrency issue that justifies the use of an atomic variable?

Not at the moment, but with asynchronous nature of quarkus and blackbox of junit it is better be safe, than sorry

I prefer don't add code that is not 100% required. Please be sure that we are dealing with concurrency issues before proceeding with this request.

We definitely need to mark the job as already stopped, so this code is 100% required. I prefer to guarantee, that this protects us from race conditions, than investigate it later.

On the second thought, in this case it is volatile field and synchronous method, what we should use.

We definitely need to mark the job as already stopped, so this code is 100% required

agree!

I prefer to guarantee, that this protects us from race conditions, than investigate it later.

I am not 100% sure about the "What If" programming to be honest. I see your point, but I would prefer to make this PR without the synchronization stuff and in a different PR try to find a real use case where we can see a race condition and then fix it. Maybe this problem is in another method too ...like ...start, close...etc also maybe we should double-check child classes that currently are extending baseService... etc In my opinion the synchronization part is not...part of this PR.
WDYT ? @fedinskiy

mjurc

I agree with @pjgg about being careful with changing behavior of base class that 22 classes extend, and about concurrency issues.

The PR proposes to change BaseService that is inherited by 22 classes because operator-managed Kafka service is getting io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Kafka] with name: [kafka-instance] in namespace: [ts-qgrutdkgws] failed exception. We can read out where the issue comes from in the stacktrace:

at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:159)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.getMandatory(BaseOperation.java:177)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:139)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:88)
	at io.quarkus.test.bootstrap.inject.OpenShiftClient.isCustomResourceReady(OpenShiftClient.java:503)
	at io.quarkus.test.services.operator.OperatorManagedResource.lambda$customResourcesAreReady$0(OperatorManagedResource.java:87)

This points us here. This method is what should be changed for now to be more robust, and with the changes in fabric8 client 6, I presume now we're getting an exception from here rather than null value in fabric8 client 5, which the previously linked isCustomResourceReady method mentioned handles.

So, we should change the linked failure spots.

The other big issue with this proposal is semantics. We cannot have both isRunning and isStopped booleans without atomically ensuring they're in consistent state compared to eachother, cos that's gonna lead to some really ugly expectations and some seriously weird state machines for those two states.

fedinskiy · 2022-10-14T15:09:24Z

@mjurc issue points us here, but the issue is a symptom, not a cause. There is a linked issue[1] in comments, and according to discussion in this thread, this issue is caused by an attempt to stop the service twice( as part of junit afterall routine and as part of closing routine). I guess you can agree, that closing a service twice is not a normal behaviour for any of 22 different types of service we have. Also, I would like to point out, that this change does not add concurrency issues, but prevents them(even if we do not have them right now).

I agree with the notion about state machines stuff. What do you think about moving isStopped check into isRunning method(and synchronising isRunning method as well)?

[1] fabric8io/kubernetes-client#4491

mjurc · 2022-10-17T08:37:46Z

@fedinskiy that still makes isStopped redundant, doesn't it? I mean, it has to be the the negation of isRunning at all times.

isRunning for OperatorManagedResource checks if CRD exists in the namespace. This is where the exception is coming from.

This does not happen in any other resource for a very specific reason - OperatorManagedResource doesn't have any cleanup by itself, it depends on cleanup from ephemeral namespaces. Deleting operators is kinda complicated and slow:

you gotta delete all resources that CSV holds
you gotta delete subscription
you gotta delete CSV
you gotta delete CRDs

Those steps are kinda slow and discovery of resources used by CSVs is not trivial - all in all, this is why OperatorManagedResource is only enabled with ephemeral namespaces.

fedinskiy · 2022-10-17T09:24:53Z

@mjurc we discussed this in person and decide to let you think about it. Current version was updated to have isStopped check inside isRunning method

michalvavrik · 2022-10-23T19:39:54Z

I can't understand why #isRunning that doesn't make any write operation should be synchronized. If you are afraid that there is expensive remote call than synchronized won't help there, and you already have #stop synchronized. Please elaborate.

mjurc · 2022-10-30T14:05:49Z

Superseded by #608

fedinskiy requested a review from pjgg October 12, 2022 15:15

fedinskiy force-pushed the fix/fabric8 branch from 4f863a8 to 4f536cc Compare October 13, 2022 06:42

pjgg suggested changes Oct 13, 2022

View reviewed changes

pjgg added the triage/backport-1.2? Backport to 1.2 branch may be required label Oct 13, 2022

fedinskiy force-pushed the fix/fabric8 branch from 4f536cc to e8d2c16 Compare October 13, 2022 12:19

fedinskiy requested a review from pjgg October 13, 2022 12:19

fedinskiy force-pushed the fix/fabric8 branch 2 times, most recently from 9782ec4 to 7971990 Compare October 13, 2022 13:59

rsvoboda requested a review from mjurc October 14, 2022 06:51

mjurc requested changes Oct 14, 2022

View reviewed changes

Fix for a failing Kafka operator job

08ef9cf

fedinskiy force-pushed the fix/fabric8 branch from 7971990 to 08ef9cf Compare October 17, 2022 09:23

mjurc mentioned this pull request Oct 27, 2022

Do not use fabric8 client after closing it #608

Merged

10 tasks

mjurc closed this Oct 30, 2022

pjgg removed the triage/backport-1.2? Backport to 1.2 branch may be required label Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for a failing Kafka operator job #595

Fix for a failing Kafka operator job #595

fedinskiy commented Oct 12, 2022

pjgg commented Oct 13, 2022

pjgg left a comment

pjgg Oct 13, 2022

fedinskiy Oct 13, 2022 •

edited

pjgg Oct 13, 2022

fedinskiy Oct 13, 2022

fedinskiy Oct 13, 2022

pjgg Oct 13, 2022 •

edited

mjurc left a comment •

edited

fedinskiy commented Oct 14, 2022

mjurc commented Oct 17, 2022

fedinskiy commented Oct 17, 2022

michalvavrik commented Oct 23, 2022

mjurc commented Oct 30, 2022

Fix for a failing Kafka operator job #595

Fix for a failing Kafka operator job #595

Conversation

fedinskiy commented Oct 12, 2022

Summary

Checklist:

pjgg commented Oct 13, 2022

pjgg left a comment

Choose a reason for hiding this comment

pjgg Oct 13, 2022

Choose a reason for hiding this comment

fedinskiy Oct 13, 2022 • edited

Choose a reason for hiding this comment

pjgg Oct 13, 2022

Choose a reason for hiding this comment

fedinskiy Oct 13, 2022

Choose a reason for hiding this comment

fedinskiy Oct 13, 2022

Choose a reason for hiding this comment

pjgg Oct 13, 2022 • edited

Choose a reason for hiding this comment

mjurc left a comment • edited

Choose a reason for hiding this comment

fedinskiy commented Oct 14, 2022

mjurc commented Oct 17, 2022

fedinskiy commented Oct 17, 2022

michalvavrik commented Oct 23, 2022

mjurc commented Oct 30, 2022

fedinskiy Oct 13, 2022 •

edited

pjgg Oct 13, 2022 •

edited

mjurc left a comment •

edited