Deterministic hashing for almost everything #10883

crusaderky · 2024-02-01T18:11:19Z

Supersedes Ensure tokenize is consistent for pickle roundtrips #10808
Supersedes add test that shows how lambda tokenization is broken dask-expr#765
Needs to go in together with Remove lambda tokenization hack dask-expr#822
Note: without this dask-expr PR, 3.12 non-dask-expr CI fails too, because the dask_expr module is loaded anyway.
For green CI evidence, see penultimate commit

github-actions · 2024-02-01T19:05:00Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

15 files ± 0 15 suites ±0 3h 27m 25s ⏱️ + 9m 0s
13 021 tests + 33 12 073 ✅ + 14 930 💤 + 1 18 ❌ +18
161 002 runs +495 144 478 ✅ +477 16 489 💤 - 17 35 ❌ +35

For more details on these failures, see this check.

Results for commit 3c8a0b1. ± Comparison against base commit 8e10a14.

This pull request removes 4 and adds 37 tests. Note that renamed tests count towards both.

dask.tests.test_tokenize ‑ test_normalize_function_limited_size
dask.tests.test_tokenize ‑ test_tokenize_local_functions
dask.tests.test_tokenize ‑ test_tokenize_numpy_ufunc_consistent
dask.tests.test_tokenize ‑ test_tokenize_pandas_no_pickle

dask.tests.test_tokenize ‑ test_local_objects
dask.tests.test_tokenize ‑ test_normalize_numpy_ufunc_unserializable
dask.tests.test_tokenize ‑ test_normalize_object_unserializable
dask.tests.test_tokenize ‑ test_tokenize_callable_class_with_tokenize_method
dask.tests.test_tokenize ‑ test_tokenize_functions_unique_token
dask.tests.test_tokenize ‑ test_tokenize_local_classes_from_different_contexts
dask.tests.test_tokenize ‑ test_tokenize_local_functions[<lambda>0]
dask.tests.test_tokenize ‑ test_tokenize_local_functions[<lambda>1]
dask.tests.test_tokenize ‑ test_tokenize_local_functions[<lambda>2]
dask.tests.test_tokenize ‑ test_tokenize_local_functions[<lambda>3]
…

♻️ This comment has been updated with latest results.

milesgranger

Nothing stands out to me, looks good. Thanks, @crusaderky!

fjetter · 2024-02-12T11:04:28Z

dask/base.py

+    if pik is None:
+        buffers.clear()
+        pik = cloudpickle.dumps(o, protocol=5, buffer_callback=buffers.append)


main disallows using cloudpickle when ensure-deterministic is enabled. This behavior was introduced in #9135. I don't consider the example reported in #9135 sufficient motivation for this behavior but cloudpickle itself states that it is not deterministic

cloudpipe/cloudpickle#453
cloudpipe/cloudpickle#385

and I'm a bit on the fence about this since I do not know what to expect now.

For instance, should this function be idempotent?

import dask from dask.base import tokenize dask.config.set({"tokenize.ensure-deterministic": True}) def foo(): import pickle class A: pass return tokenize(A()) assert foo() == foo() # Should this be True or False?

For this PR this is False since every invocation of foo returns a different token. On main this is raising a RuntimeError

The intention of the RuntimeError is to alert a user that a call includes a custom object that does not implement the __dask_tokenize__ protocol, see #6555 for the first discussion

For the record, this dynamic class determinism is not working with cloudpickle due to a uuid that is being used internally, see https://github.com/cloudpipe/cloudpickle/blob/d003266b18336e1e603536bdbe6518bc2dcc00d3/cloudpickle/cloudpickle.py#L112
i.e. cloudpickle is intentionally distinguishing the two calls. Do we want or need to do the same? From a dask user perspective, I think both calls are basically identical.

Maybe this doesn't matter, I don't know

my gut tells me it is safer to raise if we are not sure. At this time it is hard to estimate the impact on dask-expr later and removing an exception later on is easier than introducing one

It's worth noting that your example fails for local instances, but passes for local classes and functions:

# PASS def test_tokenize_local_classes_from_different_contexts(): def f(): class C: pass return C assert check_tokenize(f()) == check_tokenize(f()) # FAIL def test_tokenize_local_instances_from_different_contexts(): def f(): class C: pass return C() assert check_tokenize(f()) == check_tokenize(f()) # PASS def test_tokenize_local_functions_from_different_contexts(): def f(): def g(): return 123 return g assert check_tokenize(f()) == check_tokenize(f())

IMHO it's not terribly common to define whole classes in a local context and I can't think of a real-life scenario where this will make things break. I'm not saying impossible, just that for me it feels like an edge case.

I've added the three tests above and xfailed the instance one, linking the upstream cloudpickle issue.

Important to note is that the difference in token is generated by calling f() twice. If you call f() once, then pass the returned object through a cloudpickle round-trip, the token remains the same.

Also worth noting that assert f() == f() fails in all three above cases. So to me in the two cases that pass we're already doing more than what the Python interpreter does natively.

crusaderky self-assigned this Feb 1, 2024

crusaderky mentioned this pull request Feb 1, 2024

Remove redundant normalize_token variants #10884

Merged

crusaderky force-pushed the deterministic_tokenize branch 2 times, most recently from c9e88c8 to 41cfcdb Compare February 1, 2024 18:30

crusaderky closed this Feb 1, 2024

crusaderky reopened this Feb 1, 2024

crusaderky force-pushed the deterministic_tokenize branch 6 times, most recently from efe7a55 to 7d1538d Compare February 5, 2024 15:31

crusaderky mentioned this pull request Feb 5, 2024

Remove lambda tokenization hack dask/dask-expr#822

Merged

crusaderky force-pushed the deterministic_tokenize branch 4 times, most recently from 6925895 to a48cca7 Compare February 6, 2024 10:36

This was referenced Feb 6, 2024

Test numba tokenization #10896

Merged

Warn if tasks are submitted with identical keys but different run_spec dask/distributed#8185

Merged

Tokenization-related test tweaks (backport from #8185) dask/distributed#8499

Merged

crusaderky added 2 commits February 6, 2024 18:49

Tweak sequence tokenization

e4afc28

Deterministically tokenize almost everything

1237a45

crusaderky force-pushed the deterministic_tokenize branch from 67dabc1 to 1237a45 Compare February 6, 2024 18:57

crusaderky mentioned this pull request Feb 6, 2024

Tokenization meta-issue #10905

Closed

crusaderky changed the title ~~[DNM] Deterministic hashing for almost everything~~ Deterministic hashing for almost everything Feb 6, 2024

Merge branch 'main' into deterministic_tokenize

dfd8700

crusaderky force-pushed the deterministic_tokenize branch from d2db906 to dfd8700 Compare February 8, 2024 14:56

revert DNM's; remove xfail

d220645

crusaderky marked this pull request as ready for review February 8, 2024 15:30

polish

a2ef286

crusaderky mentioned this pull request Feb 9, 2024

Override tokenize.ensure-deterministic config flag #10913

Merged

Review documentation

f655304

crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 9, 2024

Deterministic hashing for almost everything (dask#10883)

b1811bd

This was referenced Feb 11, 2024

fix: sort objects in dicts/sets by normalize_token rather than str #10918

Closed

feat: add __dask_tokenize__ hook scikit-hep/awkward#3017

Merged

milesgranger approved these changes Feb 12, 2024

View reviewed changes

fjetter reviewed Feb 12, 2024

View reviewed changes

crusaderky added 2 commits February 12, 2024 16:59

Add tests for local objects generated in different contexts

4e53a92

code review

3c8a0b1

crusaderky merged commit 07099e5 into dask:main Feb 13, 2024
25 of 28 checks passed

crusaderky deleted the deterministic_tokenize branch February 13, 2024 16:45

This was referenced Feb 13, 2024

add test that shows how lambda tokenization is broken dask/dask-expr#765

Closed

Ensure tokenize is consistent for pickle roundtrips #10808

Closed

GPU Tokenization #6718

Closed

fjetter mentioned this pull request Feb 14, 2024

Workflow breaks due to non-deterministic pickling dask/dask-expr#879

Open

sk1p mentioned this pull request Feb 26, 2024

Snooze state display in UI LiberTEM/LiberTEM#1578

Merged

7 tasks

Cadair mentioned this pull request Apr 2, 2024

Fix Dask tokenisation performance issues by not hashing DKISTDC/dkist#361

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic hashing for almost everything #10883

Deterministic hashing for almost everything #10883

crusaderky commented Feb 1, 2024 •

edited

github-actions bot commented Feb 1, 2024 •

edited

milesgranger left a comment

fjetter Feb 12, 2024

fjetter Feb 12, 2024

fjetter Feb 12, 2024

fjetter Feb 12, 2024

fjetter Feb 12, 2024

crusaderky Feb 12, 2024

crusaderky Feb 12, 2024

crusaderky Feb 12, 2024

Deterministic hashing for almost everything #10883

Deterministic hashing for almost everything #10883

Conversation

crusaderky commented Feb 1, 2024 • edited

github-actions bot commented Feb 1, 2024 • edited

Unit Test Results

milesgranger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Feb 1, 2024 •

edited

github-actions bot commented Feb 1, 2024 •

edited