-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization meta-issue #10905
Comments
@fjetter offline you expressed concern for dask-expr optimization performance. I'm observing a 50~150ms slowdown for the full TPCH queries. runtime for graph definition + optimization. end-to-end runtime on the Coiled cluster: The other TPCH queries show similar behaviour. |
All PRs are now only waiting for review |
Summary of changes
The issue of key collision on the scheduler (same key, but different run_spec and possibly different dependencies), which is chiefly caused by #9888, has been mitigated: Legend✔️ produces correct output
[1] this is not great and could deserve a follow-up |
As the ongoing changes in tokenization are getting more complicated, I'm writing a meta-issue that maps them down.
High level goals
tokenize()
is idempotent (call it twice on the same object, get the same token)tokenize()
is deterministic (call it twice on identical objects, or on the same object after a serialization round-trip, and get the same token). This is limited to the same interpreter. Determinism is not guaranteed across interpreters.tokenize()
can't return a deterministic result, there is a system for notifying the dask code (e.g. so that you don't raise after comparing two non-deterministic tokens)There are a handful of known objects that violate idempotency/determinism:
object()
is idempotent, but not deterministic (by choice, as it's normally used as a singleton).Notably, all callables (including lambdas) become deterministic.
PRs
4a. Deterministic hashing for almost everything #10883
4b. Remove lambda tokenization hack dask-expr#822
run_spec
distributed#8185Closes
Superseded PRs
Other actions
✔️ A/B tests show no impact whatsoever from the additional tokenization labour on the end-to-end workflows in coiled/benchmarks
✔️ A/B tests on dask-expr optimization show 50~150ms slowdown for production-sized TPCH queries, which IMHO is negligible
The text was updated successfully, but these errors were encountered: