Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult #108094

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

arteam
Copy link
Contributor

@arteam arteam commented Apr 30, 2024

It seems that the failure (the missed index) has always existed in the test scenario and it's supposed to be handled by TransportGetTaskAction.java. We catch IndexNotFoundException here and convert it to ResourceNotFoundException. Then we catch ResourceNotFoundException here and return a snapshot of a task as a response.

In the stack trace, getFinishedTaskFromIndex was called from getRunningTaskFromNode, not from waitedForCompletion due to a race between creating a get request and unblocking request which are sent asynchronously. I've changed the waitForCompletionTestCase test method to unblock the task only after the request started waiting for the task completion by registering a removal listener. By doing so, we make sure we test the "wait for completion" branch when task is running.

The part about the missed index seems to irrelevant, since waitedForCompletion is able to suppress the error and return a snapshot of running task which is not possible if getFinishedTaskFromIndex gets called directly from getRunningTaskFromNode.

Resolves #107823

Make sure the `.tasks` index is created before we starting testing task completion
without storing its result. To achieve that, we store a fake task before we start
`waitForCompletionTestCase`.

Resolves #107823
@arteam arteam added >test Issues or PRs that are addressing/adding tests :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. labels Apr 30, 2024
@elasticsearchmachine elasticsearchmachine added Team:Distributed Meta label for distributed team v8.15.0 labels Apr 30, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@arteam
Copy link
Contributor Author

arteam commented May 2, 2024

@elasticmachine update branch

@arteam arteam requested review from idegtiarenko, volodk85 and DaveCTurner and removed request for idegtiarenko and DaveCTurner May 2, 2024 07:26
@arteam
Copy link
Contributor Author

arteam commented May 7, 2024

@elasticmachine update branch

@arteam arteam requested review from idegtiarenko, DaveCTurner, volodk85 and a team and removed request for idegtiarenko, volodk85 and DaveCTurner May 8, 2024 07:56
@henningandersen
Copy link
Contributor

The linked issue says that the tasks index got deleted, but that does not seem to match the resolution here? Can we find out why the tasks index was deleted too soon instead?

@arteam
Copy link
Contributor Author

arteam commented May 13, 2024

@henningandersen I believe the comment in the linked issue is wrong. The index was never deleted, because the test doesn't create the index. The test waits for the a completion of a task and the tasks only completes, because we have special error handling for the case where the index doesn't exist. I guess in some cases the error handling doesn't can't figure out that the root cause was IndexNotFoundException which should be converted to ResourceNotFoundException which is silently ignored.

I believe we shoud just explicitly create the index, because testGetTaskWaitForCompletionWithoutStoringResult is supposed to test task completion, not the error handling for missed indexes which is done in testGetTaskNotFound and testTasksGetWaitForNoTask.

@henningandersen
Copy link
Contributor

@arteam it still smells like we might be covering up for a bug here. AFAICS, we expect the logic to work regardless of whether the index exists or not. Can you elaborate on how the test differentiates between whether the task exists or not? Since it if it is within the actual tasks code, we may want to target that instead (as well as add a dedicated test for it).

@DaveCTurner
Copy link
Contributor

DaveCTurner commented May 15, 2024 via email

@arteam
Copy link
Contributor Author

arteam commented May 15, 2024

I'm pretty sure #108052 had no effect here, it was a pure refactoring.

Sorry about that! I deleted my comment right I realized that #108052 indeed just removed dead code, I was confused by the line numbers in the stack trace.

@arteam
Copy link
Contributor Author

arteam commented May 15, 2024

Still, the only way can I see the test failing is ExceptionsHelper.unwrap(e, ResourceNotFoundException.class) returning null. In fact, if I replace it with if (false) the error stack trace seems exactly like the one in the issue. Not sure how it is possible, though.

@arteam
Copy link
Contributor Author

arteam commented May 18, 2024

@elasticmachine update branch

@henningandersen
Copy link
Contributor

The main problem seems to be that the test case does not find the task running, see this part of the stack trace:

      at org.elasticsearch.action.admin.cluster.node.tasks.get.TransportGetTaskAction.getRunningTaskFromNode(TransportGetTaskAction.java:140)

which is this line.

This is where the focus should go I think. The test ran in less than 50ms, so it is not something timeout related, rather likely some race. I did a bit of digging but did not find it.

I do notice that the test case is a suite case, which are sometimes disturbed by prior test. I did not find any such evidence though, so might be a red herring.

I notice that the test writes Test task finished on the node, so the test task was not cancelled either, since then I believe it would not output that.

@arteam
Copy link
Contributor Author

arteam commented May 21, 2024

@elasticmachine update branch

@arteam
Copy link
Contributor Author

arteam commented May 21, 2024

@henningandersen That was a very good catch! getFinishedTaskFromIndex was called from getRunningTaskFromNode, not from waitedForCompletion. There indeed seems to be a race between creating a get request and unblocking request which are sent asynchronously. I've changed waitForCompletionTestCase to unblock the task only after the request started waiting for the task completion by registering a removal listener. By doing so, we make sure we test the "wait for completion" branch when task is running.

The part about the missed index seems to irrelevant, since waitedForCompletion is able to suppress the error and return a snapshot of running task which is not possible if getFinishedTaskFromIndex gets called directly from getRunningTaskFromNode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. Team:Distributed Meta label for distributed team >test Issues or PRs that are addressing/adding tests v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] TasksIT testGetTaskWaitForCompletionWithoutStoringResult failing
5 participants