Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization meta-issue #10905

Closed
crusaderky opened this issue Feb 6, 2024 · 3 comments · Fixed by dask/distributed#8512
Closed

Tokenization meta-issue #10905

crusaderky opened this issue Feb 6, 2024 · 3 comments · Fixed by dask/distributed#8512
Assignees

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Feb 6, 2024

As the ongoing changes in tokenization are getting more complicated, I'm writing a meta-issue that maps them down.

High level goals

  • Ensure that tokenize() is idempotent (call it twice on the same object, get the same token)
  • Ensure that tokenize() is deterministic (call it twice on identical objects, or on the same object after a serialization round-trip, and get the same token). This is limited to the same interpreter. Determinism is not guaranteed across interpreters.
  • Ensure that, when tokenize() can't return a deterministic result, there is a system for notifying the dask code (e.g. so that you don't raise after comparing two non-deterministic tokens)
  • Robustly detect when Reuse of keys in blockwise fusion can cause spurious KeyErrors on distributed cluster #9888 happens in order to mitigate its impact

There are a handful of known objects that violate idempotency/determinism:

  • object() is idempotent, but not deterministic (by choice, as it's normally used as a singleton).
  • objects that can't be serialized with cloudpickle are neither idempotent nor deterministic. Expect them to break spectacularly in dask_expr for sure, and probably going forward in many other places too.

Notably, all callables (including lambdas) become deterministic.

PRs

  1. Make tokenization more deterministic #10876
  2. Tokenize SubgraphCallable #10898
  3. Tweak sequence tokenization #10904
  4. these two must go in together:
    4a. Deterministic hashing for almost everything #10883
    4b. Remove lambda tokenization hack dask-expr#822
  5. Test numba tokenization #10896
  6. Remove redundant normalize_token variants #10884
  7. Override tokenize.ensure-deterministic config flag #10913
  8. Config toggle to disable blockwise fusion #10909
  9. tokenize: Don't call str() on dict values #10919
  10. Tweaks to update_graph (backport from #8185) distributed#8498
  11. Tokenization-related test tweaks (backport from #8185) distributed#8499
  12. Warn if tasks are submitted with identical keys but different run_spec distributed#8185
  13. Keep old dependencies on run_spec collision distributed#8512

Closes

Superseded PRs

Other actions

✔️ A/B tests show no impact whatsoever from the additional tokenization labour on the end-to-end workflows in coiled/benchmarks
✔️ A/B tests on dask-expr optimization show 50~150ms slowdown for production-sized TPCH queries, which IMHO is negligible

@github-actions github-actions bot added the needs triage Needs a response from a contributor label Feb 6, 2024
@crusaderky crusaderky removed the needs triage Needs a response from a contributor label Feb 6, 2024
@crusaderky crusaderky self-assigned this Feb 6, 2024
@crusaderky
Copy link
Collaborator Author

@fjetter offline you expressed concern for dask-expr optimization performance. I'm observing a 50~150ms slowdown for the full TPCH queries.
IMHO it's negligible.

runtime for graph definition + optimization.
Note that it incorporates fetching the metadata of the input dataframe from s3, which I suspect takes the lion share's of both the mean time and the variance (this is just an intuition; I didn't collect numerical evidence about it).
image

end-to-end runtime on the Coiled cluster:
image

The other TPCH queries show similar behaviour.

@crusaderky
Copy link
Collaborator Author

All PRs are now only waiting for review

@crusaderky
Copy link
Collaborator Author

crusaderky commented Feb 19, 2024

Summary of changes

  • tokenize() is now deterministic, within the same interpreter, in most cases
  • in the rare edge cases where it is not, you can trust tokenize(..., ensure_determinstic=True) to raise robustly
  • there should be no expectation of determinism across interpreter restarts, hosts, OSs, or dependency versions

The issue of key collision on the scheduler (same key, but different run_spec and possibly different dependencies), which is chiefly caused by #9888, has been mitigated:

Legend

✔️ produces correct output
📛 cluster crashes or hangs on AssertionError
😕 task completes successfully, but output is wrong
⚠️ emits warning in the scheduler log

run_spec Task output Dependencies Old task status 2024.2.0 2024.2.1 Use case of #9888?
same same same * ✔️ ✔️ no
differs same same pending ✔️ ⚠️ ✔️ yes
differs same new task has fewer pending ✔️ ⚠️ ✔️ yes
differs same new task has more pending 📛 ⚠️ ✔️ yes
differs same same memory ✔️ ✔️ yes
differs same new task has fewer memory ✔️ ⚠️ ✔️ yes
differs same new task has more memory ✔️ ✔️ yes
differs same * released ✔️ ⚠️ ✔️ yes
differs differs same pending 😕 ⚠️ 😕 no
differs differs new task has fewer pending 😕 ⚠️ 😕 no
differs differs new task has more pending 📛 ⚠️ 😕 no
differs differs same memory 😕 😕[1] no
differs differs new task has fewer memory 😕 ⚠️ 😕 no
differs differs new task has more memory 😕 😕[1] no
differs differs * released 😕 ⚠️ 😕 no

[1] this is not great and could deserve a follow-up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant