Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition which would allow a rescheduled task to be reported missing even though it is not #5160

Merged
merged 1 commit into from Oct 8, 2021

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Aug 2, 2021

There are two race conditions in here. I could only produce one of them in a test

If a key is transitioned to "flight", a gather_dep task for this key is scheduled. This race condition requires the task to be scheduled in batch with another set of tasks. If one task of this batch is released before the gather_dep coroutine is first executed but after the transition in ensure_communicate, the gather_dep coroutine will filter the released task and will only try to fetch results of the remote for the tasks still in state "flight".
Once the response is available, all keys are tried to be parsed in the response, not only the actually fetched ones. I marked this situation with a FIXME a while ago but didn't anticipate the impact nor did I have an actual test backing this up.

Two things can now happen:

  1. The key is actually released and stays released. In this case we'll take the "already-executing" code path which is wrong. We'll get a log for a released task claiming it was executed. No big deal. We'll survive a bad log message.
  2. However, If during the data collection, i.e. during get_data_from_worker the forgotten key is rescheduled, this results in us taking the "missing-data" path since the data for the key is not in the response, the task is not in state memory, the remote was not busy and the task has still dependents since it was rescheduled. In this case the worker issues a missing data signal to the scheduler which deletes the possibly only replica of that task on the remote which in turn might escalate pretty badly.

I tried very hard in reproducing 2.) using multiple mocks and locks 馃く but it is extremely difficult to produce due to the rather monolithic design of that particular section of code. I settled for for a test which covers only 1.) since 2.) is a special case of 1.) I intend to eliminate such race conditions by implementing a cancelled state as described in #4413 (comment) but this is way out of scope for this PR. The test will still be useful, though.

@fjetter fjetter force-pushed the missing_gather_dep_not_flight branch 2 times, most recently from fd3cbb0 to 9a90025 Compare September 8, 2021 14:44
@fjetter fjetter force-pushed the missing_gather_dep_not_flight branch from 9a90025 to 3da4d9e Compare October 4, 2021 12:59
@fjetter
Copy link
Member Author

fjetter commented Oct 4, 2021

With #5046 this problem still persists. I updated the patch and wrote a new test for the case where all keys are filtered out in which case gather_dep should be a no-op

@fjetter fjetter force-pushed the missing_gather_dep_not_flight branch from 3da4d9e to 1e570a0 Compare October 7, 2021 10:40
@jrbourbeau jrbourbeau mentioned this pull request Oct 7, 2021
3 tasks
@fjetter
Copy link
Member Author

fjetter commented Oct 8, 2021

I would like to merge the PRs around work stealing in the following order

  1. Fix a race condition which would allow a rescheduled task to be reported missing even though it is not聽#5160
  2. Fix regression where unknown tasks were allowed to be stolen聽#5392
  3. Resolve work stealing deadlock caused by race in move_task_confirm聽#5379
  4. Long running occupancy聽#5395
  5. Increase latency overhead in stealing cost calculation聽#5390

Since this is the first in line and relatively small, I'll go ahead and merge since other PRs partially fail due to this race condition. If there are any comments, I'll happily work them in afterwards

@fjetter fjetter merged commit c2b3add into dask:main Oct 8, 2021
# Keep namespace clean since this func is long and has many
# dep*, *ts* variables

assert cause is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required by mypy, as later in the code you're accessing a property of cause.
Reinstated in #2393.

@fjetter fjetter added deadlock The cluster appears to not make any progress stealing labels Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deadlock The cluster appears to not make any progress stealing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants