-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tokenization more deterministic #10876
Conversation
f9248fa
to
2866e5a
Compare
2866e5a
to
e1b6e48
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 3h 21m 28s ⏱️ -50s Results for commit 052d1cf. ± Comparison against base commit 97f47b4. This pull request removes 13 and adds 14 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
f530dbb
to
bf08d1e
Compare
690460e
to
7184889
Compare
# This variable is recreated anew every time you call tokenize(). Note that this means | ||
# that you could call tokenize() from inside tokenize() and they would be fully | ||
# independent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dask-expr actually does this. A previous version was creating the seen dict in the outermost call to _normalize_seq_func
instead of in tokenize
, and that was making a test in dask-expr fail.
376c1db
to
733a03d
Compare
733a03d
to
6d9a7cb
Compare
6d9a7cb
to
25cf33f
Compare
25cf33f
to
a190d9f
Compare
a190d9f
to
ca8a0ff
Compare
ca8a0ff
to
0f07cdf
Compare
dask/tests/test_tokenize.py
Outdated
function_cache.clear() | ||
|
||
|
||
def tokenize_roundtrip(*args, idempotent=True, deterministic=None, copy=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this called _roundtrip
? I don't see any actual "roundtripping" in there, am I missing something?
Also, this function does a lot under the hood, I'm wondering if it would be easier to understand what's being tested if this were more explicit. This would result in significantly more test code but personally, I prefer DAMP tests over DRY ones, so I don't see this as much of a problem. If not, is there an easy way to test tokenize_roundtrip
? Basically all our testing relies on it to work, so having a dedicated unit test would make me feel more comfortable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It tests tokenization after a pickle round-trip.
Not too happy about the name either. Should we just rename it to check_tokenize
?
I'm very happy about how encapsulated it is though. It allowed a huge amount of issues to crop up with tokenization of the various object types.
Added unit tests for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too happy about the name either. Should we just rename it to
check_tokenize
?
Yes, something like check_tokenize
, checked_tokenize
, asserting_tokenize
would sound better to me. I'll let you choose!
I'm very happy about how DAMP it is though. It caused a huge amount of issues to crop up with tokenization of the various object types.
Don't get me wrong, it looks a lot better than what we had before! After renaming and adding unit tests for the function, this should be clear enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All done!
CI on |
should be fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @crusaderky! Consider the nits non-blocking and the comment just a clarification for my understanding.
@@ -763,6 +754,15 @@ def dispatch(self, cls): | |||
register() | |||
self._lazy.pop(toplevel, None) | |||
return self.dispatch(cls) # recurse | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we switching the order here? What went wrong before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3019d13
to
bcb2a3e
Compare
Fix mindeps Update dask/tests/test_tokenize.py Co-authored-by: Hendrik Makait <hendrik@makait.com> Update dask/tests/test_tokenize.py Co-authored-by: Hendrik Makait <hendrik@makait.com> nit lint
6bbf1dc
to
052d1cf
Compare
Hi, This happened at vaexio/vaex#2331 indeed. Do you know if this was unavoidable, or can this be fixed? Regards, Maarten |
Also, are there plans to keep them stable in the future, or will dask not make this guarantee (our cache keys depend on that, and our CI will fail if they change) |
@maartenbreddels I would expect token output to change infrequently (although it will change again in the next release after all PRs of #10905 get merged). As a rule of thumb, you should not rely on it to be stable across releases. In fact, it's not even guaranteed to be stable across interpreter restarts - it just happens to be in most use cases. I advise to tweak your tests so that they don't test against hardcoded token output; instead, they should verify that two identical objects produce the same token and that two different objects produce different tokens. |
Why is that? If you use dask, and persist something in your cluster, you'd want to have the same hashkey/fingerprint right? At least, this was my assumption, and therefore we build on top of dask's hashing feature. |
tokenize() now hashes pickle or cloudpickle output for unknown objects (including local functions). Add to this that one may (commonly) run the Client on Windows or MacOSX and the scheduler and workers on Linux, which introduces an extra layer of uncertainty in the tokenization process because, again, pickle/cloudpickle output should not be expected to be identical across OS'es and possibly different versions of some of the 200+ packages that are typically deployed, unpinned, in a typical dask environment.
tokenize() is designed to generate unique graph keys during graph definition time, which happens on the client. It was never designed to produce a stable, cross-interpreter, cross-host, cross-OS, cross-version fingerprint of arbitrary data. |
Cut-down variant of #10808, without the more problematic changes about
normalize_object
andnormalize_callable
This PR changes all tokens. Expect downstream tests that assert vs. hardcoded dask keys to fail.