Tokenization meta-issue #10905

crusaderky · 2024-02-06T19:42:36Z

As the ongoing changes in tokenization are getting more complicated, I'm writing a meta-issue that maps them down.

High level goals

Ensure that tokenize() is idempotent (call it twice on the same object, get the same token)
Ensure that tokenize() is deterministic (call it twice on identical objects, or on the same object after a serialization round-trip, and get the same token). This is limited to the same interpreter. Determinism is not guaranteed across interpreters.
Ensure that, when tokenize() can't return a deterministic result, there is a system for notifying the dask code (e.g. so that you don't raise after comparing two non-deterministic tokens)
Robustly detect when Reuse of keys in blockwise fusion can cause spurious KeyErrors on distributed cluster #9888 happens in order to mitigate its impact

There are a handful of known objects that violate idempotency/determinism:

object() is idempotent, but not deterministic (by choice, as it's normally used as a singleton).
objects that can't be serialized with cloudpickle are neither idempotent nor deterministic. Expect them to break spectacularly in dask_expr for sure, and probably going forward in many other places too.

Notably, all callables (including lambdas) become deterministic.

PRs

Closes

Superseded PRs

Other actions

✔️ A/B tests show no impact whatsoever from the additional tokenization labour on the end-to-end workflows in coiled/benchmarks
✔️ A/B tests on dask-expr optimization show 50~150ms slowdown for production-sized TPCH queries, which IMHO is negligible

The text was updated successfully, but these errors were encountered:

crusaderky · 2024-02-07T15:34:04Z

@fjetter offline you expressed concern for dask-expr optimization performance. I'm observing a 50~150ms slowdown for the full TPCH queries.
IMHO it's negligible.

runtime for graph definition + optimization.
Note that it incorporates fetching the metadata of the input dataframe from s3, which I suspect takes the lion share's of both the mean time and the variance (this is just an intuition; I didn't collect numerical evidence about it).

end-to-end runtime on the Coiled cluster:

The other TPCH queries show similar behaviour.

crusaderky · 2024-02-10T20:42:31Z

All PRs are now only waiting for review

crusaderky · 2024-02-19T17:14:52Z

Summary of changes

tokenize() is now deterministic, within the same interpreter, in most cases
in the rare edge cases where it is not, you can trust tokenize(..., ensure_determinstic=True) to raise robustly
there should be no expectation of determinism across interpreter restarts, hosts, OSs, or dependency versions

The issue of key collision on the scheduler (same key, but different run_spec and possibly different dependencies), which is chiefly caused by #9888, has been mitigated:

Legend

✔️ produces correct output
📛 cluster crashes or hangs on AssertionError
😕 task completes successfully, but output is wrong
⚠️ emits warning in the scheduler log

run_spec	Task output	Dependencies	Old task status	2024.2.0	2024.2.1	Use case of #9888?
same	same	same	*	✔️	✔️	no
differs	same	same	pending	✔️	⚠️ ✔️	yes
differs	same	new task has fewer	pending	✔️	⚠️ ✔️	yes
differs	same	new task has more	pending	📛	⚠️ ✔️	yes
differs	same	same	memory	✔️	✔️	yes
differs	same	new task has fewer	memory	✔️	⚠️ ✔️	yes
differs	same	new task has more	memory	✔️	✔️	yes
differs	same	*	released	✔️	⚠️ ✔️	yes
differs	differs	same	pending	😕	⚠️ 😕	no
differs	differs	new task has fewer	pending	😕	⚠️ 😕	no
differs	differs	new task has more	pending	📛	⚠️ 😕	no
differs	differs	same	memory	😕	😕[1]	no
differs	differs	new task has fewer	memory	😕	⚠️ 😕	no
differs	differs	new task has more	memory	😕	😕[1]	no
differs	differs	*	released	😕	⚠️ 😕	no

[1] this is not great and could deserve a follow-up

github-actions bot added the needs triage Needs a response from a contributor label Feb 6, 2024

crusaderky removed the needs triage Needs a response from a contributor label Feb 6, 2024

crusaderky self-assigned this Feb 6, 2024

crusaderky mentioned this issue Feb 7, 2024

add test that shows how lambda tokenization is broken dask/dask-expr#765

Closed

This was referenced Feb 11, 2024

Tokenize SubgraphCallable #10898

Merged

fix: sort objects in dicts/sets by normalize_token rather than str #10918

Closed

array.layout.form touches shape scikit-hep/awkward#3018

Closed

This was referenced Feb 14, 2024

Make tokenization more deterministic #10876

Merged

Warn if tasks are submitted with identical keys but different run_spec dask/distributed#8185

Merged

Keep old dependencies on run_spec collision dask/distributed#8512

Merged

crusaderky closed this as completed in dask/distributed#8512 Feb 19, 2024

crusaderky mentioned this issue Mar 6, 2024

test_tokenize_function_cloudpickle is very flaky #10875

Closed

This was referenced Mar 13, 2024

Race condition in scatter->dereference->scatter dask/distributed#8576

Open

Tokenize bag groupby keys #10734

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization meta-issue #10905

Tokenization meta-issue #10905

crusaderky commented Feb 6, 2024 •

edited

crusaderky commented Feb 7, 2024

crusaderky commented Feb 10, 2024

crusaderky commented Feb 19, 2024 •

edited

Tokenization meta-issue #10905

Tokenization meta-issue #10905

Comments

crusaderky commented Feb 6, 2024 • edited

High level goals

PRs

Closes

Superseded PRs

Other actions

crusaderky commented Feb 7, 2024

crusaderky commented Feb 10, 2024

crusaderky commented Feb 19, 2024 • edited

Summary of changes

Legend

crusaderky commented Feb 6, 2024 •

edited

crusaderky commented Feb 19, 2024 •

edited