Make tokenization more deterministic #10876

crusaderky · 2024-01-30T15:32:01Z

Cut-down variant of #10808, without the more problematic changes about normalize_object and normalize_callable

Thoroughly test tokenize idempotency and determinism (same output after a pickle roundtrip)
Tougher tests in general
Avoid ambiguity between collections
Deterministic tokenization for collections containing circular references
Better tokenization for dataclasses, numpy objects, and sparse matrices

This PR changes all tokens. Expect downstream tests that assert vs. hardcoded dask keys to fail.

github-actions · 2024-01-30T16:44:11Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

15 files ± 0 15 suites ±0 3h 21m 28s ⏱️ -50s
12 987 tests + 1 12 058 ✅ + 2 929 💤 ± 0 0 ❌ - 1
160 492 runs +15 143 983 ✅ +32 16 509 💤 - 16 0 ❌ - 1

Results for commit 052d1cf. ± Comparison against base commit 97f47b4.

This pull request removes 13 and adds 14 tests. Note that renamed tests count towards both.

dask.tests.test_delayed ‑ test_name_consistent_across_instances
dask.tests.test_tokenize ‑ test_normalize_function_dataclass_field_no_repr
dask.tests.test_tokenize ‑ test_tokenize_datetime_date
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[bsr]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[coo]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[csc]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[csr]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[dia]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[lil]
dask.tests.test_tokenize ‑ test_tokenize_function_cloudpickle
…

dask.tests.test_delayed ‑ test_deterministic_name
dask.tests.test_tokenize ‑ test_check_tokenize
dask.tests.test_tokenize ‑ test_empty_numpy_array
dask.tests.test_tokenize ‑ test_tokenize_callable_class
dask.tests.test_tokenize ‑ test_tokenize_circular_recursion
dask.tests.test_tokenize ‑ test_tokenize_dataclass_field_no_repr
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other0]
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other1]
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other2]
dask.tests.test_tokenize ‑ test_tokenize_local_functions
…

♻️ This comment has been updated with latest results.

crusaderky · 2024-01-31T16:49:33Z

dask/base.py

+# This variable is recreated anew every time you call tokenize(). Note that this means
+# that you could call tokenize() from inside tokenize() and they would be fully
+# independent.


dask-expr actually does this. A previous version was creating the seen dict in the outermost call to _normalize_seq_func instead of in tokenize, and that was making a test in dask-expr fail.

dask/tests/test_base.py

hendrikmakait · 2024-02-06T11:44:46Z

dask/tests/test_tokenize.py

+        function_cache.clear()
+
+
+def tokenize_roundtrip(*args, idempotent=True, deterministic=None, copy=None, **kwargs):


Why is this called _roundtrip? I don't see any actual "roundtripping" in there, am I missing something?

Also, this function does a lot under the hood, I'm wondering if it would be easier to understand what's being tested if this were more explicit. This would result in significantly more test code but personally, I prefer DAMP tests over DRY ones, so I don't see this as much of a problem. If not, is there an easy way to test tokenize_roundtrip? Basically all our testing relies on it to work, so having a dedicated unit test would make me feel more comfortable.

It tests tokenization after a pickle round-trip.
Not too happy about the name either. Should we just rename it to check_tokenize?
I'm very happy about how encapsulated it is though. It allowed a huge amount of issues to crop up with tokenization of the various object types.
Added unit tests for it.

Not too happy about the name either. Should we just rename it to check_tokenize?

Yes, something like check_tokenize, checked_tokenize, asserting_tokenize would sound better to me. I'll let you choose!

I'm very happy about how DAMP it is though. It caused a huge amount of issues to crop up with tokenization of the various object types.

Don't get me wrong, it looks a lot better than what we had before! After renaming and adding unit tests for the function, this should be clear enough.

dask/base.py

hendrikmakait · 2024-02-06T13:43:56Z

CI on mindeps isn't happy with test_check_tokenize.

crusaderky · 2024-02-06T14:42:49Z

CI on mindeps isn't happy with test_check_tokenize.

should be fixed

hendrikmakait

Thanks, @crusaderky! Consider the nits non-blocking and the comment just a clarification for my understanding.

dask/tests/test_tokenize.py

hendrikmakait · 2024-02-06T15:10:27Z

dask/utils.py

@@ -763,6 +754,15 @@ def dispatch(self, cls):
                register()
                self._lazy.pop(toplevel, None)
                return self.dispatch(cls)  # recurse
+            try:


Why are we switching the order here? What went wrong before?

#10808 (comment)

Fix mindeps Update dask/tests/test_tokenize.py Co-authored-by: Hendrik Makait <hendrik@makait.com> Update dask/tests/test_tokenize.py Co-authored-by: Hendrik Makait <hendrik@makait.com> nit lint

maartenbreddels · 2024-02-14T12:05:50Z

Expect downstream tests that assert vs. hardcoded dask keys to fail.

Hi,

This happened at vaexio/vaex#2331 indeed. Do you know if this was unavoidable, or can this be fixed?

Regards,

Maarten

maartenbreddels · 2024-02-14T12:07:46Z

Also, are there plans to keep them stable in the future, or will dask not make this guarantee (our cache keys depend on that, and our CI will fail if they change)

crusaderky · 2024-02-14T12:22:19Z

@maartenbreddels I would expect token output to change infrequently (although it will change again in the next release after all PRs of #10905 get merged). As a rule of thumb, you should not rely on it to be stable across releases. In fact, it's not even guaranteed to be stable across interpreter restarts - it just happens to be in most use cases.

I advise to tweak your tests so that they don't test against hardcoded token output; instead, they should verify that two identical objects produce the same token and that two different objects produce different tokens.

maartenbreddels · 2024-02-14T12:28:35Z

In fact, it's not even guaranteed to be stable across interpreter restarts

Why is that? If you use dask, and persist something in your cluster, you'd want to have the same hashkey/fingerprint right? At least, this was my assumption, and therefore we build on top of dask's hashing feature.

crusaderky · 2024-02-14T15:16:42Z

Why is that?

tokenize() now hashes pickle or cloudpickle output for unknown objects (including local functions).
cloudpickle does not guarantee cross-interpreter determinism in many edge cases.

Add to this that one may (commonly) run the Client on Windows or MacOSX and the scheduler and workers on Linux, which introduces an extra layer of uncertainty in the tokenization process because, again, pickle/cloudpickle output should not be expected to be identical across OS'es and possibly different versions of some of the 200+ packages that are typically deployed, unpinned, in a typical dask environment.

If you use dask, and persist something in your cluster, you'd want to have the same hashkey/fingerprint right?

tokenize() is designed to generate unique graph keys during graph definition time, which happens on the client.
Starting from dask/distributed#8185, it's also going to be used, on the scheduler, to verify that the run_spec (the values of the dask graph) are identical if keys are identical.

It was never designed to produce a stable, cross-interpreter, cross-host, cross-OS, cross-version fingerprint of arbitrary data.

crusaderky marked this pull request as draft January 30, 2024 15:32

crusaderky force-pushed the tokenize_without_object branch from f9248fa to 2866e5a Compare January 30, 2024 16:10

crusaderky self-assigned this Jan 30, 2024

crusaderky force-pushed the tokenize_without_object branch from 2866e5a to e1b6e48 Compare January 30, 2024 16:15

crusaderky force-pushed the tokenize_without_object branch 5 times, most recently from f530dbb to bf08d1e Compare January 31, 2024 12:57

crusaderky changed the title ~~Make more tokenizations deterministic~~ Make tokenization more deterministic Jan 31, 2024

crusaderky force-pushed the tokenize_without_object branch 2 times, most recently from 690460e to 7184889 Compare January 31, 2024 16:37

crusaderky commented Jan 31, 2024

View reviewed changes

crusaderky marked this pull request as ready for review January 31, 2024 17:06

crusaderky force-pushed the tokenize_without_object branch from 376c1db to 733a03d Compare February 1, 2024 14:53

crusaderky added a commit to fjetter/dask that referenced this pull request Feb 1, 2024

Delta from dask#10876

1d4b876

This was referenced Feb 1, 2024

Deterministic hashing for almost everything #10883

Merged

Remove redundant normalize_token variants #10884

Merged

crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 2, 2024

Delta from dask#10876

fe04cc4

crusaderky force-pushed the tokenize_without_object branch from 733a03d to 6d9a7cb Compare February 2, 2024 12:14

crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 5, 2024

Delta from dask#10876

6cd0ff5

crusaderky force-pushed the tokenize_without_object branch from 6d9a7cb to 25cf33f Compare February 5, 2024 10:05

crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 5, 2024

Delta from dask#10876

8e8f560

crusaderky force-pushed the tokenize_without_object branch from 25cf33f to a190d9f Compare February 5, 2024 13:11

crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 5, 2024

Delta from dask#10876

06526e5

crusaderky force-pushed the tokenize_without_object branch from a190d9f to ca8a0ff Compare February 5, 2024 15:31

crusaderky mentioned this pull request Feb 5, 2024

Remove lambda tokenization hack dask/dask-expr#822

Merged

Make tokenization more deterministic

0f07cdf

crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 6, 2024

Delta from dask#10876

a9b17c4

crusaderky force-pushed the tokenize_without_object branch from ca8a0ff to 0f07cdf Compare February 6, 2024 09:32

crusaderky mentioned this pull request Feb 6, 2024

Test numba tokenization #10896

Merged

hendrikmakait self-requested a review February 6, 2024 10:41

hendrikmakait reviewed Feb 6, 2024

View reviewed changes

crusaderky mentioned this pull request Feb 6, 2024

Warn if tasks are submitted with identical keys but different run_spec dask/distributed#8185

Merged

hendrikmakait approved these changes Feb 6, 2024

View reviewed changes

crusaderky force-pushed the tokenize_without_object branch from 3019d13 to bcb2a3e Compare February 6, 2024 15:42

crusaderky added 2 commits February 6, 2024 15:49

Code review

25e655c

Fix mindeps Update dask/tests/test_tokenize.py Co-authored-by: Hendrik Makait <hendrik@makait.com> Update dask/tests/test_tokenize.py Co-authored-by: Hendrik Makait <hendrik@makait.com> nit lint

Merge branch 'main' into tokenize_without_object

052d1cf

crusaderky force-pushed the tokenize_without_object branch from 6bbf1dc to 052d1cf Compare February 6, 2024 15:50

crusaderky merged commit f51fa77 into dask:main Feb 6, 2024
26 of 27 checks passed

crusaderky deleted the tokenize_without_object branch February 6, 2024 17:50

crusaderky mentioned this pull request Feb 6, 2024

Tokenization meta-issue #10905

Closed

fjetter mentioned this pull request Feb 12, 2024

Override tokenize.ensure-deterministic config flag #10913

Merged

crusaderky mentioned this pull request Feb 13, 2024

Non deterministic tokenization for empty numpy arrays after pickle roundtrip #10799

Closed

maartenbreddels mentioned this pull request Feb 14, 2024

Support CPython 3.11, 3.12, and aarch64 processors vaexio/vaex#2331

Open

andersy005 mentioned this pull request Feb 27, 2024

CI Failure in Xarray test suite post-Dask tokenization update pydata/xarray#8788

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tokenization more deterministic #10876

Make tokenization more deterministic #10876

crusaderky commented Jan 30, 2024 •

edited

github-actions bot commented Jan 30, 2024 •

edited

crusaderky Jan 31, 2024 •

edited

hendrikmakait Feb 6, 2024

crusaderky Feb 6, 2024 •

edited

hendrikmakait Feb 6, 2024 •

edited

crusaderky Feb 6, 2024

hendrikmakait commented Feb 6, 2024 •

edited

crusaderky commented Feb 6, 2024

hendrikmakait left a comment

hendrikmakait Feb 6, 2024

crusaderky Feb 6, 2024

maartenbreddels commented Feb 14, 2024

maartenbreddels commented Feb 14, 2024

crusaderky commented Feb 14, 2024

maartenbreddels commented Feb 14, 2024

crusaderky commented Feb 14, 2024

		function_cache.clear()


		def tokenize_roundtrip(args, idempotent=True, deterministic=None, copy=None, *kwargs):

Make tokenization more deterministic #10876

Make tokenization more deterministic #10876

Conversation

crusaderky commented Jan 30, 2024 • edited

github-actions bot commented Jan 30, 2024 • edited

Unit Test Results

crusaderky Jan 31, 2024 • edited

Choose a reason for hiding this comment

hendrikmakait Feb 6, 2024

Choose a reason for hiding this comment

crusaderky Feb 6, 2024 • edited

Choose a reason for hiding this comment

hendrikmakait Feb 6, 2024 • edited

Choose a reason for hiding this comment

crusaderky Feb 6, 2024

Choose a reason for hiding this comment

hendrikmakait commented Feb 6, 2024 • edited

crusaderky commented Feb 6, 2024

hendrikmakait left a comment

Choose a reason for hiding this comment

hendrikmakait Feb 6, 2024

Choose a reason for hiding this comment

crusaderky Feb 6, 2024

Choose a reason for hiding this comment

maartenbreddels commented Feb 14, 2024

maartenbreddels commented Feb 14, 2024

crusaderky commented Feb 14, 2024

maartenbreddels commented Feb 14, 2024

crusaderky commented Feb 14, 2024

crusaderky commented Jan 30, 2024 •

edited

github-actions bot commented Jan 30, 2024 •

edited

crusaderky Jan 31, 2024 •

edited

crusaderky Feb 6, 2024 •

edited

hendrikmakait Feb 6, 2024 •

edited

hendrikmakait commented Feb 6, 2024 •

edited