New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deterministic hashing for almost everything #10883
Conversation
c9e88c8
to
41cfcdb
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 3h 27m 25s ⏱️ + 9m 0s For more details on these failures, see this check. Results for commit 3c8a0b1. ± Comparison against base commit 8e10a14. This pull request removes 4 and adds 37 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
efe7a55
to
7d1538d
Compare
6925895
to
a48cca7
Compare
67dabc1
to
1237a45
Compare
d2db906
to
dfd8700
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing stands out to me, looks good. Thanks, @crusaderky!
if pik is None: | ||
buffers.clear() | ||
pik = cloudpickle.dumps(o, protocol=5, buffer_callback=buffers.append) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main
disallows using cloudpickle when ensure-deterministic
is enabled. This behavior was introduced in #9135. I don't consider the example reported in #9135 sufficient motivation for this behavior but cloudpickle itself states that it is not deterministic
cloudpipe/cloudpickle#453
cloudpipe/cloudpickle#385
and I'm a bit on the fence about this since I do not know what to expect now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, should this function be idempotent?
import dask
from dask.base import tokenize
dask.config.set({"tokenize.ensure-deterministic": True})
def foo():
import pickle
class A:
pass
return tokenize(A())
assert foo() == foo() # Should this be True or False?
For this PR this is False
since every invocation of foo
returns a different token. On main
this is raising a RuntimeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of the RuntimeError
is to alert a user that a call includes a custom object that does not implement the __dask_tokenize__
protocol, see #6555 for the first discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, this dynamic class determinism is not working with cloudpickle due to a uuid that is being used internally, see https://github.com/cloudpipe/cloudpickle/blob/d003266b18336e1e603536bdbe6518bc2dcc00d3/cloudpickle/cloudpickle.py#L112
i.e. cloudpickle is intentionally distinguishing the two calls. Do we want or need to do the same? From a dask user perspective, I think both calls are basically identical.
Maybe this doesn't matter, I don't know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my gut tells me it is safer to raise if we are not sure. At this time it is hard to estimate the impact on dask-expr later and removing an exception later on is easier than introducing one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that your example fails for local instances, but passes for local classes and functions:
# PASS
def test_tokenize_local_classes_from_different_contexts():
def f():
class C:
pass
return C
assert check_tokenize(f()) == check_tokenize(f())
# FAIL
def test_tokenize_local_instances_from_different_contexts():
def f():
class C:
pass
return C()
assert check_tokenize(f()) == check_tokenize(f())
# PASS
def test_tokenize_local_functions_from_different_contexts():
def f():
def g():
return 123
return g
assert check_tokenize(f()) == check_tokenize(f())
IMHO it's not terribly common to define whole classes in a local context and I can't think of a real-life scenario where this will make things break. I'm not saying impossible, just that for me it feels like an edge case.
I've added the three tests above and xfailed the instance one, linking the upstream cloudpickle issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important to note is that the difference in token is generated by calling f() twice. If you call f() once, then pass the returned object through a cloudpickle round-trip, the token remains the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth noting that assert f() == f()
fails in all three above cases. So to me in the two cases that pass we're already doing more than what the Python interpreter does natively.
Note: without this dask-expr PR, 3.12 non-dask-expr CI fails too, because the dask_expr module is loaded anyway.