Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for a failing Kafka operator job #595

Closed
wants to merge 1 commit into from

Conversation

fedinskiy
Copy link
Contributor

Summary

Please check the relevant options

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Dependency update
  • Refactoring
  • Breaking change (fix or feature that would cause existing functionality to change)
  • This change requires a documentation update
  • This change requires execution against OCP (use run tests phrase in comment)

Checklist:

  • Example scenarios has been updated / added
  • Methods and classes used in PR scenarios are meaningful
  • Commits are well encapsulated and follow the best practices

@pjgg
Copy link
Contributor

pjgg commented Oct 13, 2022

run tests

Copy link
Contributor

@pjgg pjgg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes or clarifications

@@ -41,6 +42,9 @@
private final Map<String, String> properties = new HashMap<>();
private final List<Runnable> futureProperties = new LinkedList<>();

//todo workaround for https://github.com/fabric8io/kubernetes-client/issues/4491
private final AtomicBoolean isStopped = new AtomicBoolean(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see any concurrency issue that justifies the use of an atomic variable?

Copy link
Contributor Author

@fedinskiy fedinskiy Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment, but with asynchronous nature of quarkus and blackbox of junit it is better be safe, than sorry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer don't add code that is not 100% required. Please be sure that we are dealing with concurrency issues before proceeding with this request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need to mark the job as already stopped, so this code is 100% required. I prefer to guarantee, that this protects us from race conditions, than investigate it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the second thought, in this case it is volatile field and synchronous method, what we should use.

Copy link
Contributor

@pjgg pjgg Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need to mark the job as already stopped, so this code is 100% required

  • agree!

I prefer to guarantee, that this protects us from race conditions, than investigate it later.

  • I am not 100% sure about the "What If" programming to be honest. I see your point, but I would prefer to make this PR without the synchronization stuff and in a different PR try to find a real use case where we can see a race condition and then fix it. Maybe this problem is in another method too ...like ...start, close...etc also maybe we should double-check child classes that currently are extending baseService... etc In my opinion the synchronization part is not...part of this PR.
    WDYT ? @fedinskiy

@pjgg pjgg added the triage/backport-1.2? Backport to 1.2 branch may be required label Oct 13, 2022
@fedinskiy fedinskiy requested a review from pjgg October 13, 2022 12:19
@fedinskiy fedinskiy force-pushed the fix/fabric8 branch 2 times, most recently from 9782ec4 to 7971990 Compare October 13, 2022 13:59
@rsvoboda rsvoboda requested a review from mjurc October 14, 2022 06:51
Copy link
Member

@mjurc mjurc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @pjgg about being careful with changing behavior of base class that 22 classes extend, and about concurrency issues.

The PR proposes to change BaseService that is inherited by 22 classes because operator-managed Kafka service is getting io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Kafka] with name: [kafka-instance] in namespace: [ts-qgrutdkgws] failed exception. We can read out where the issue comes from in the stacktrace:

at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:159)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.getMandatory(BaseOperation.java:177)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:139)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:88)
	at io.quarkus.test.bootstrap.inject.OpenShiftClient.isCustomResourceReady(OpenShiftClient.java:503)
	at io.quarkus.test.services.operator.OperatorManagedResource.lambda$customResourcesAreReady$0(OperatorManagedResource.java:87)

This points us here. This method is what should be changed for now to be more robust, and with the changes in fabric8 client 6, I presume now we're getting an exception from here rather than null value in fabric8 client 5, which the previously linked isCustomResourceReady method mentioned handles.

So, we should change the linked failure spots.

The other big issue with this proposal is semantics. We cannot have both isRunning and isStopped booleans without atomically ensuring they're in consistent state compared to eachother, cos that's gonna lead to some really ugly expectations and some seriously weird state machines for those two states.

@fedinskiy
Copy link
Contributor Author

@mjurc issue points us here, but the issue is a symptom, not a cause. There is a linked issue[1] in comments, and according to discussion in this thread, this issue is caused by an attempt to stop the service twice( as part of junit afterall routine and as part of closing routine). I guess you can agree, that closing a service twice is not a normal behaviour for any of 22 different types of service we have. Also, I would like to point out, that this change does not add concurrency issues, but prevents them(even if we do not have them right now).

I agree with the notion about state machines stuff. What do you think about moving isStopped check into isRunning method(and synchronising isRunning method as well)?

[1] fabric8io/kubernetes-client#4491

@mjurc
Copy link
Member

mjurc commented Oct 17, 2022

@fedinskiy that still makes isStopped redundant, doesn't it? I mean, it has to be the the negation of isRunning at all times.

isRunning for OperatorManagedResource checks if CRD exists in the namespace. This is where the exception is coming from.

This does not happen in any other resource for a very specific reason - OperatorManagedResource doesn't have any cleanup by itself, it depends on cleanup from ephemeral namespaces. Deleting operators is kinda complicated and slow:

  1. you gotta delete all resources that CSV holds
  2. you gotta delete subscription
  3. you gotta delete CSV
  4. you gotta delete CRDs

Those steps are kinda slow and discovery of resources used by CSVs is not trivial - all in all, this is why OperatorManagedResource is only enabled with ephemeral namespaces.

@fedinskiy
Copy link
Contributor Author

@mjurc we discussed this in person and decide to let you think about it. Current version was updated to have isStopped check inside isRunning method

@michalvavrik
Copy link
Contributor

I can't understand why #isRunning that doesn't make any write operation should be synchronized. If you are afraid that there is expensive remote call than synchronized won't help there, and you already have #stop synchronized. Please elaborate.

@mjurc mjurc closed this Oct 30, 2022
@mjurc
Copy link
Member

mjurc commented Oct 30, 2022

Superseded by #608

@pjgg pjgg removed the triage/backport-1.2? Backport to 1.2 branch may be required label Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants