Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring gahter_dep / Remove missing data message #6544

Closed
wants to merge 9 commits into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jun 9, 2022

This is building on top of #6388 with a few significant changes

All changes are in the latest commit on top of the PR for review

  • It is not reusing any "helper" code between GatherDep result event handlers other than _gather_dep_done_common which is basically solely to reset some state and set the ts.done attribute. All other shared usage of code for the result parsing of gather_dep has been a major source of inconsistencies and we should avoid it
  • It already incorporates / closes What are the usecases for Scheduler.handle_missing_data? #6445 since the missing-data message is not required. I realize this is significant scope creep to the actual refactor. We could likely incorporate this again into the event handler upon success-but-missing but I would not add it to the network-failure result since this can cause otherwise severe race conditions, see also Add back Worker.transition_fetch_missing #6112 (comment)
  • It no longer transitions tasks directly to missing after a network failure. This breaks a few validation checks about fetching data but overall greatly simplifies the code. this relies on us picking up empty ts.who has sets in various places in the code which then transitions tasks to missing

Comment on lines -104 to -114
@gen_cluster(client=True)
async def test_worker_find_missing(c, s, a, b):
fut = c.submit(inc, 1, workers=[a.address])
await fut
# We do not want to use proper API since it would ensure that the cluster is
# informed properly
del a.data[fut.key]
del a.tasks[fut.key]

# Actually no worker has the data; the scheduler is supposed to reschedule
assert await c.submit(inc, fut, workers=[b.address]) == 3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is a bit nonsense. I don't think we should be allowed to simplify mess with the state and expect the system to recover

Comment on lines -4186 to -4187
assert ts in self.data_needed
assert ts.who_has
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As motivated in the description, we rely now on the fact that there are tasks in fetch with this being empty but we rely on the transition system to pick this up and transition the task. This allows us to write easier recommendations

Copy link
Collaborator

@crusaderky crusaderky Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert ts in self.data_needed is still valid, as the validate_task_state is only called at the end of a transitions cycle. It's only in transition_fetch_missing, which is in the middle of a cycle, that it is not.

block_get_data.set()

await wait_for_state("x", "missing", a)
# await wait_for_state("y", "missing", a)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y is not reliably transitioned to missing since it is stuck in fetch, depending on how fast ther RefreshWhoHas comes in.
This could be restored if we transitioned to missing immediately but I pointed out already a couple of times that I don't think this is a requirement and by not doing this we have much simpler recommendation code

@github-actions
Copy link
Contributor

github-actions bot commented Jun 9, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   6h 14m 12s ⏱️ - 13m 20s
  2 858 tests +  3    2 774 ✔️ +  1    81 💤 ±0  3 +2 
21 172 runs  +21  20 225 ✔️ +18  944 💤 +1  3 +2 

For more details on these failures, see this check.

Results for commit 43fc518. ± Comparison against base commit 879fb89.

del ts

for ts in self._gather_dep_done_common(ev):
ts.who_has.discard(worker)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will never happen due to line 3444

else:
ts.who_has.discard(ev.worker)
self.has_what[ev.worker].discard(ts.key)
recommendations[ts] = "fetch"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
recommendations[ts] = "fetch"
if self.validate:
assert ts.state != "fetch"
assert ts not in self.data_needed_per_worker[ev.worker]
recommendations[ts] = "fetch"

Otherwise _select_keys_for_gather will fail later

@fjetter fjetter closed this Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

What are the usecases for Scheduler.handle_missing_data?
2 participants