Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic hashing for almost everything #10883

Merged
merged 8 commits into from Feb 13, 2024

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Feb 1, 2024

@crusaderky crusaderky self-assigned this Feb 1, 2024
@crusaderky crusaderky force-pushed the deterministic_tokenize branch 2 times, most recently from c9e88c8 to 41cfcdb Compare February 1, 2024 18:30
Copy link
Contributor

github-actions bot commented Feb 1, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ±  0       15 suites  ±0   3h 27m 25s ⏱️ + 9m 0s
 13 021 tests + 33   12 073 ✅ + 14     930 💤 + 1  18 ❌ +18 
161 002 runs  +495  144 478 ✅ +477  16 489 💤  - 17  35 ❌ +35 

For more details on these failures, see this check.

Results for commit 3c8a0b1. ± Comparison against base commit 8e10a14.

This pull request removes 4 and adds 37 tests. Note that renamed tests count towards both.
dask.tests.test_tokenize ‑ test_normalize_function_limited_size
dask.tests.test_tokenize ‑ test_tokenize_local_functions
dask.tests.test_tokenize ‑ test_tokenize_numpy_ufunc_consistent
dask.tests.test_tokenize ‑ test_tokenize_pandas_no_pickle
dask.tests.test_tokenize ‑ test_local_objects
dask.tests.test_tokenize ‑ test_normalize_numpy_ufunc_unserializable
dask.tests.test_tokenize ‑ test_normalize_object_unserializable
dask.tests.test_tokenize ‑ test_tokenize_callable_class_with_tokenize_method
dask.tests.test_tokenize ‑ test_tokenize_functions_unique_token
dask.tests.test_tokenize ‑ test_tokenize_local_classes_from_different_contexts
dask.tests.test_tokenize ‑ test_tokenize_local_functions[<lambda>0]
dask.tests.test_tokenize ‑ test_tokenize_local_functions[<lambda>1]
dask.tests.test_tokenize ‑ test_tokenize_local_functions[<lambda>2]
dask.tests.test_tokenize ‑ test_tokenize_local_functions[<lambda>3]
…

♻️ This comment has been updated with latest results.

@crusaderky crusaderky changed the title [DNM] Deterministic hashing for almost everything Deterministic hashing for almost everything Feb 6, 2024
@crusaderky crusaderky marked this pull request as ready for review February 8, 2024 15:30
Copy link
Contributor

@milesgranger milesgranger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing stands out to me, looks good. Thanks, @crusaderky!

Comment on lines +1220 to +1222
if pik is None:
buffers.clear()
pik = cloudpickle.dumps(o, protocol=5, buffer_callback=buffers.append)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main disallows using cloudpickle when ensure-deterministic is enabled. This behavior was introduced in #9135. I don't consider the example reported in #9135 sufficient motivation for this behavior but cloudpickle itself states that it is not deterministic

cloudpipe/cloudpickle#453
cloudpipe/cloudpickle#385

and I'm a bit on the fence about this since I do not know what to expect now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, should this function be idempotent?

import dask
from dask.base import tokenize
dask.config.set({"tokenize.ensure-deterministic": True})

def foo():
    import pickle
    class A:
        pass
    return tokenize(A())

assert foo() == foo()  # Should this be True or False?

For this PR this is False since every invocation of foo returns a different token. On main this is raising a RuntimeError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of the RuntimeError is to alert a user that a call includes a custom object that does not implement the __dask_tokenize__ protocol, see #6555 for the first discussion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, this dynamic class determinism is not working with cloudpickle due to a uuid that is being used internally, see https://github.com/cloudpipe/cloudpickle/blob/d003266b18336e1e603536bdbe6518bc2dcc00d3/cloudpickle/cloudpickle.py#L112
i.e. cloudpickle is intentionally distinguishing the two calls. Do we want or need to do the same? From a dask user perspective, I think both calls are basically identical.

Maybe this doesn't matter, I don't know

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my gut tells me it is safer to raise if we are not sure. At this time it is hard to estimate the impact on dask-expr later and removing an exception later on is easier than introducing one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that your example fails for local instances, but passes for local classes and functions:

# PASS
def test_tokenize_local_classes_from_different_contexts():
    def f():
        class C:
            pass

        return C

    assert check_tokenize(f()) == check_tokenize(f())


# FAIL
def test_tokenize_local_instances_from_different_contexts():
    def f():
        class C:
            pass

        return C()

    assert check_tokenize(f()) == check_tokenize(f())


# PASS
def test_tokenize_local_functions_from_different_contexts():
    def f():
        def g():
            return 123

        return g

    assert check_tokenize(f()) == check_tokenize(f())

IMHO it's not terribly common to define whole classes in a local context and I can't think of a real-life scenario where this will make things break. I'm not saying impossible, just that for me it feels like an edge case.

I've added the three tests above and xfailed the instance one, linking the upstream cloudpickle issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important to note is that the difference in token is generated by calling f() twice. If you call f() once, then pass the returned object through a cloudpickle round-trip, the token remains the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth noting that assert f() == f() fails in all three above cases. So to me in the two cases that pass we're already doing more than what the Python interpreter does natively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants