Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tokenization more deterministic #10876

Merged
merged 3 commits into from
Feb 6, 2024

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Jan 30, 2024

Cut-down variant of #10808, without the more problematic changes about normalize_object and normalize_callable

  • Thoroughly test tokenize idempotency and determinism (same output after a pickle roundtrip)
  • Tougher tests in general
  • Avoid ambiguity between collections
  • Deterministic tokenization for collections containing circular references
  • Better tokenization for dataclasses, numpy objects, and sparse matrices

This PR changes all tokens. Expect downstream tests that assert vs. hardcoded dask keys to fail.

@crusaderky crusaderky marked this pull request as draft January 30, 2024 15:32
@crusaderky crusaderky self-assigned this Jan 30, 2024
Copy link
Contributor

github-actions bot commented Jan 30, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ± 0       15 suites  ±0   3h 21m 28s ⏱️ -50s
 12 987 tests + 1   12 058 ✅ + 2     929 💤 ± 0  0 ❌  - 1 
160 492 runs  +15  143 983 ✅ +32  16 509 💤  - 16  0 ❌  - 1 

Results for commit 052d1cf. ± Comparison against base commit 97f47b4.

This pull request removes 13 and adds 14 tests. Note that renamed tests count towards both.
dask.tests.test_delayed ‑ test_name_consistent_across_instances
dask.tests.test_tokenize ‑ test_normalize_function_dataclass_field_no_repr
dask.tests.test_tokenize ‑ test_tokenize_datetime_date
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[bsr]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[coo]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[csc]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[csr]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[dia]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[lil]
dask.tests.test_tokenize ‑ test_tokenize_function_cloudpickle
…
dask.tests.test_delayed ‑ test_deterministic_name
dask.tests.test_tokenize ‑ test_check_tokenize
dask.tests.test_tokenize ‑ test_empty_numpy_array
dask.tests.test_tokenize ‑ test_tokenize_callable_class
dask.tests.test_tokenize ‑ test_tokenize_circular_recursion
dask.tests.test_tokenize ‑ test_tokenize_dataclass_field_no_repr
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other0]
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other1]
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other2]
dask.tests.test_tokenize ‑ test_tokenize_local_functions
…

♻️ This comment has been updated with latest results.

@crusaderky crusaderky force-pushed the tokenize_without_object branch 5 times, most recently from f530dbb to bf08d1e Compare January 31, 2024 12:57
@crusaderky crusaderky changed the title Make more tokenizations deterministic Make tokenization more deterministic Jan 31, 2024
@crusaderky crusaderky force-pushed the tokenize_without_object branch 2 times, most recently from 690460e to 7184889 Compare January 31, 2024 16:37
Comment on lines +1080 to +1082
# This variable is recreated anew every time you call tokenize(). Note that this means
# that you could call tokenize() from inside tokenize() and they would be fully
# independent.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask-expr actually does this. A previous version was creating the seen dict in the outermost call to _normalize_seq_func instead of in tokenize, and that was making a test in dask-expr fail.

@crusaderky crusaderky marked this pull request as ready for review January 31, 2024 17:06
crusaderky added a commit to fjetter/dask that referenced this pull request Feb 1, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 2, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 5, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 5, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 5, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 6, 2024
dask/tests/test_base.py Outdated Show resolved Hide resolved
function_cache.clear()


def tokenize_roundtrip(*args, idempotent=True, deterministic=None, copy=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called _roundtrip? I don't see any actual "roundtripping" in there, am I missing something?

Also, this function does a lot under the hood, I'm wondering if it would be easier to understand what's being tested if this were more explicit. This would result in significantly more test code but personally, I prefer DAMP tests over DRY ones, so I don't see this as much of a problem. If not, is there an easy way to test tokenize_roundtrip? Basically all our testing relies on it to work, so having a dedicated unit test would make me feel more comfortable.

Copy link
Collaborator Author

@crusaderky crusaderky Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tests tokenization after a pickle round-trip.
Not too happy about the name either. Should we just rename it to check_tokenize?
I'm very happy about how encapsulated it is though. It allowed a huge amount of issues to crop up with tokenization of the various object types.
Added unit tests for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too happy about the name either. Should we just rename it to check_tokenize?

Yes, something like check_tokenize, checked_tokenize, asserting_tokenize would sound better to me. I'll let you choose!

I'm very happy about how DAMP it is though. It caused a huge amount of issues to crop up with tokenization of the various object types.

Don't get me wrong, it looks a lot better than what we had before! After renaming and adding unit tests for the function, this should be clear enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done!

dask/base.py Outdated Show resolved Hide resolved
@hendrikmakait
Copy link
Member

hendrikmakait commented Feb 6, 2024

CI on mindeps isn't happy with test_check_tokenize.

@crusaderky
Copy link
Collaborator Author

CI on mindeps isn't happy with test_check_tokenize.

should be fixed

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @crusaderky! Consider the nits non-blocking and the comment just a clarification for my understanding.

dask/tests/test_tokenize.py Show resolved Hide resolved
dask/tests/test_tokenize.py Outdated Show resolved Hide resolved
@@ -763,6 +754,15 @@ def dispatch(self, cls):
register()
self._lazy.pop(toplevel, None)
return self.dispatch(cls) # recurse
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we switching the order here? What went wrong before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix mindeps

Update dask/tests/test_tokenize.py

Co-authored-by: Hendrik Makait <hendrik@makait.com>
Update dask/tests/test_tokenize.py

Co-authored-by: Hendrik Makait <hendrik@makait.com>
nit

lint
@crusaderky crusaderky merged commit f51fa77 into dask:main Feb 6, 2024
26 of 27 checks passed
@crusaderky crusaderky deleted the tokenize_without_object branch February 6, 2024 17:50
@maartenbreddels
Copy link
Contributor

Expect downstream tests that assert vs. hardcoded dask keys to fail.

Hi,

This happened at vaexio/vaex#2331 indeed. Do you know if this was unavoidable, or can this be fixed?

Regards,

Maarten

@maartenbreddels
Copy link
Contributor

Also, are there plans to keep them stable in the future, or will dask not make this guarantee (our cache keys depend on that, and our CI will fail if they change)

@crusaderky
Copy link
Collaborator Author

@maartenbreddels I would expect token output to change infrequently (although it will change again in the next release after all PRs of #10905 get merged). As a rule of thumb, you should not rely on it to be stable across releases. In fact, it's not even guaranteed to be stable across interpreter restarts - it just happens to be in most use cases.

I advise to tweak your tests so that they don't test against hardcoded token output; instead, they should verify that two identical objects produce the same token and that two different objects produce different tokens.

@maartenbreddels
Copy link
Contributor

In fact, it's not even guaranteed to be stable across interpreter restarts

Why is that? If you use dask, and persist something in your cluster, you'd want to have the same hashkey/fingerprint right? At least, this was my assumption, and therefore we build on top of dask's hashing feature.

@crusaderky
Copy link
Collaborator Author

Why is that?

tokenize() now hashes pickle or cloudpickle output for unknown objects (including local functions).
cloudpickle does not guarantee cross-interpreter determinism in many edge cases.

Add to this that one may (commonly) run the Client on Windows or MacOSX and the scheduler and workers on Linux, which introduces an extra layer of uncertainty in the tokenization process because, again, pickle/cloudpickle output should not be expected to be identical across OS'es and possibly different versions of some of the 200+ packages that are typically deployed, unpinned, in a typical dask environment.

If you use dask, and persist something in your cluster, you'd want to have the same hashkey/fingerprint right?

tokenize() is designed to generate unique graph keys during graph definition time, which happens on the client.
Starting from dask/distributed#8185, it's also going to be used, on the scheduler, to verify that the run_spec (the values of the dask graph) are identical if keys are identical.

It was never designed to produce a stable, cross-interpreter, cross-host, cross-OS, cross-version fingerprint of arbitrary data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants