New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for a failing Kafka operator job #595
Conversation
4f863a8
to
4f536cc
Compare
run tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes or clarifications
quarkus-test-core/src/main/java/io/quarkus/test/bootstrap/BaseService.java
Outdated
Show resolved
Hide resolved
@@ -41,6 +42,9 @@ | |||
private final Map<String, String> properties = new HashMap<>(); | |||
private final List<Runnable> futureProperties = new LinkedList<>(); | |||
|
|||
//todo workaround for https://github.com/fabric8io/kubernetes-client/issues/4491 | |||
private final AtomicBoolean isStopped = new AtomicBoolean(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see any concurrency issue that justifies the use of an atomic variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment, but with asynchronous nature of quarkus and blackbox of junit it is better be safe, than sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer don't add code that is not 100% required. Please be sure that we are dealing with concurrency issues before proceeding with this request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need to mark the job as already stopped, so this code is 100% required. I prefer to guarantee, that this protects us from race conditions, than investigate it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the second thought, in this case it is volatile field and synchronous method, what we should use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need to mark the job as already stopped, so this code is 100% required
- agree!
I prefer to guarantee, that this protects us from race conditions, than investigate it later.
- I am not 100% sure about the "What If" programming to be honest. I see your point, but I would prefer to make this PR without the synchronization stuff and in a different PR try to find a real use case where we can see a race condition and then fix it. Maybe this problem is in another method too ...like ...start, close...etc also maybe we should double-check child classes that currently are extending baseService... etc In my opinion the synchronization part is not...part of this PR.
WDYT ? @fedinskiy
4f536cc
to
e8d2c16
Compare
9782ec4
to
7971990
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @pjgg about being careful with changing behavior of base class that 22 classes extend, and about concurrency issues.
The PR proposes to change BaseService
that is inherited by 22 classes because operator-managed Kafka service is getting io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Kafka] with name: [kafka-instance] in namespace: [ts-qgrutdkgws] failed
exception. We can read out where the issue comes from in the stacktrace:
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:159)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.getMandatory(BaseOperation.java:177)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:139)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.get(BaseOperation.java:88)
at io.quarkus.test.bootstrap.inject.OpenShiftClient.isCustomResourceReady(OpenShiftClient.java:503)
at io.quarkus.test.services.operator.OperatorManagedResource.lambda$customResourcesAreReady$0(OperatorManagedResource.java:87)
This points us here. This method is what should be changed for now to be more robust, and with the changes in fabric8 client 6, I presume now we're getting an exception from here rather than null
value in fabric8 client 5, which the previously linked isCustomResourceReady
method mentioned handles.
So, we should change the linked failure spots.
The other big issue with this proposal is semantics. We cannot have both isRunning
and isStopped
booleans without atomically ensuring they're in consistent state compared to eachother, cos that's gonna lead to some really ugly expectations and some seriously weird state machines for those two states.
@mjurc issue points us here, but the issue is a symptom, not a cause. There is a linked issue[1] in comments, and according to discussion in this thread, this issue is caused by an attempt to stop the service twice( as part of junit afterall routine and as part of closing routine). I guess you can agree, that closing a service twice is not a normal behaviour for any of 22 different types of service we have. Also, I would like to point out, that this change does not add concurrency issues, but prevents them(even if we do not have them right now). I agree with the notion about state machines stuff. What do you think about moving isStopped check into isRunning method(and synchronising isRunning method as well)? |
@fedinskiy that still makes
This does not happen in any other resource for a very specific reason -
Those steps are kinda slow and discovery of resources used by CSVs is not trivial - all in all, this is why |
7971990
to
08ef9cf
Compare
@mjurc we discussed this in person and decide to let you think about it. Current version was updated to have |
I can't understand why |
Superseded by #608 |
Summary
Please check the relevant options
run tests
phrase in comment)Checklist: