add test that shows how lambda tokenization is broken #765

fjetter · 2024-01-18T11:08:06Z

#223 had good intention and introduced a utility function _tokenize_deterministic that is setting the tokenize.ensure-deterministic flag such that we can ensure that the tokens we're generating are indeed deterministic.
Determinism in this context is mildly ambiguous and there are subtly different consistency guarantees we need if we want to pickle Expr objects in a way that is deterministic across python processes/interpreters

Essentially everything we'd need to fall back to cloudpickle to serialize is also susceptible to not being deterministically tokenized (therefore this loosely relates to #331 (comment) even though those are separate issues) since the tokenization is using cloudpickle behind the scenes for those cases. In way that defeats the entire purpose of the tokenization engine (see point 3. below).

#223 also implemented a "fix" for LambdaType functions which is using the str representation of those functions which looks like <function func.<locals>.<lambda> at 0x107363be0>. This is unique inside of an interpreter session since it includes the memory address where the lambda is stored but this changes once moved to another interpreter.

This effectively hard blocks #14

There are a couple of options (I'm not really happy about either but here we go...)

Enforce pickle as a serialization protocol and forbid usage of local functions, types, etc. (effectively Add custom serialization check to assert_eq #331 but this would expand to all user APIs. We could wrap user defined objects with some magical class but this would feel a little awkward I think)
A slightly better/milder version of 1). I believe we can define deterministic cross-interpreter tokens for lambdas/local functions. I had something like this in a version of Ensure tokenize is consistent for pickle roundtrips dask#10808 but it turned out to not be python version stable so I ended up removing it again. I'd have to run a couple
Abandon tokenization entirely. Inside of an interpreter this problem is easy since if an object doesn't implement a __hash__ we could just use it's id for a unique identifier and hash (or just stick with the current engine although it feels slightly over engineered then). From what I understand the entire tokenization is merely for the distributed scheduler but I see a way how to avoid that. This would require us to rewrite how client and scheduler are communicating in dask/distributed but I believe this is doable with a reasonable amount of effort. The scheduler would de-facto be the source of truth and be responsible for key generation and not the other way round. Locally generated keys would no longer map to remote keys. That would be fine, I guess?

cc @rjzamora @mrocklin @phofl

Note: This problem has been around forever. HLGs were not as much affected by this since their names are actually not unique (e.g. dask/dask#9888). Expr are calculating names/tokens recursively which is objectively much better for key uniqueness but also means that if there is any object inside of the expr-tree that cannot be tokenized deterministically, this affects the entire tree.

fjetter · 2024-01-18T11:11:27Z

Unless there are other options I am not seeing, 3) would be the most permissive thinking about users. Since this is also the one with the most uncertainty I'll probably invest some time and whip up a prototype outlining what I'm thinking.

Short term we may just postpone scheduler integration since I don't think this is critical for a release but I'd like to know what our options are. If we were forced to go with option 1) it would be good to know before the release

fjetter · 2024-01-18T12:15:25Z

There is also the option of making cloudpickle actually deterministic. There has been a little progress on that front cloudpipe/cloudpickle#453 cloudpipe/cloudpickle#524

mrocklin · 2024-01-18T13:47:55Z

Can someone help me understand the cost of non-deterministic hashing in dask-expr? Historically hashing was a nice way to avoid work, but if it didn't work then 🤷‍♂️ we just did a bit more work. I'm guessing that this comes down to the fact that we recreate the same object repeatedly in the optimization process, and need to see if the object is the same or not. Is that right?

…

On Thu, Jan 18, 2024 at 6:15 AM Florian Jetter ***@***.***> wrote: There is also the option of making cloudpickle actually deterministic. There has been a little progress on that front cloudpipe/cloudpickle#453 <cloudpipe/cloudpickle#453> cloudpipe/cloudpickle#524 <cloudpipe/cloudpickle#524> — Reply to this email directly, view it on GitHub <#765 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTD5WCR7GCNR5UJGCE3YPEG6RAVCNFSM6AAAAABCACROXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJYGM3DSMZXGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

phofl · 2024-01-18T13:57:26Z

The assumption that the name attribute of an expression is the same if the expression has exactly the same operands is deeply baked into our optimization logic

fjetter · 2024-01-18T14:37:15Z

Apart form the optimization logic itself, the way distributed is currently built is roughly as

Client side

Take expr and optimize it
Generate the final output keys
Use those keys to generate Futures we can fetch and listen to

Scheduler side

Take the expression and materialize it
Compute stuff
If one of the keys that the client asked for is ready, let the client know (If the keys generated on client side don't match with the actual graph keys we'll deadlock because the client is waiting for the results of keys that don't exist

Changing this procedure is what I meant with 3) but this is a bit of work. I was hoping to not have to do this, particularly not while still supporting HLGs

mrocklin · 2024-01-18T14:42:37Z

One dumb approach (maybe not the approach that we want to take) is that we optionally store a finalized key on the expression before we serialize it. So on the client side we do all of our optimizations, get a nice expr that we like, finalize the key on the expression, then serialize it up to the scheduler. The _name property checks for that finalized key and if it exists uses it instead of tokenization.

This seems like it would be easy short term. It would stop us from doing scheduler-side optimization (which I think we're all excited about medium-term) but might give us the fast client-scheduler communications we want in the meantime.

rjzamora · 2024-01-18T15:11:31Z

I haven't looked carefully through this code or other linked work yet. However, just to check my rough understanding - It seems like there are two challenges here:

Client-side dask-expr depends on deterministic tokenization to function. We identify an expression by its name, and its name depends directly on the tokenization of its operands. There are some cases (e.g. local functions) where the necessary "determinism" is broken and/or it limits what a user can do. Possible solutions:
- Find a way to avoid tokenization (seems difficult)
- Use something like inspect.getsource to automatically convert local/lambda objects to a deterministic strings (probably still has problems/limitations)
- Something else...
In order for us to materialize expressions on the scheduler, we need the expression names to be the same on the client and scheduler. However, even some objects that pass the _tokenize_deterministic test will not actually tokenize to the same value on the client and scheduler. Possible solutions:
- We do something like Matts last suggestion. To unlock scheduler-side optimizations in the future, it may even be possible to "freeze" the keys on the client and then "remap" them on the scheduler. Things should be okay if the scheduler keeps track of the mapping between the new and original keys (maybe?).
- Something else...

fjetter · 2024-01-18T15:51:32Z

I think I'd be fine with a finalized name after optimization

rjzamora · 2024-01-18T15:58:10Z

@fjetter - Just a note that I submitted #767 to demonstrate how the tests added in this PR will pass if we use inspect to normalize local/lambda functions.

fjetter · 2024-01-18T16:05:47Z

reposting here

the most recent commit on main of cloudpickle (not released yet) likely solves the lambda problem but there are still cases that are not handled well.

It's easy for us to verify on scheduler side if the tokenization was successful so we can abort with a clear exception. Maybe the new cloudpickle plus that scheduler side verification is good enough for us?

crusaderky · 2024-02-07T12:18:30Z

Status update

The PRs linked from dask/dask#10905 make lambdas tokenization deterministic, with the current released version of cloudpickle, on the same Python interpreter. Nothing should be expected across processes or hosts.

So we'll either need tokens to be generated on the client and then shipped to the scheduler, or the scheduler to generate them first and ship them back to the client (the latter is more involved because as of today we use batched comms, which is one-directional, but the client would need to be informed of what keys they just gained in who_wants).

It would be trivial to add a single line to the dask tokenize function that bakes in an unique interpreter ID, so that we can see things fail hard and fast in trivial use cases instead of only on the edge cases of cloudpickle.

It of course remains possible to have some obscure third-party object (callable or not) that doesn't cloudpickle deterministically AND comes with neither __dask_tokenize__ method nor normalize_token.register attached to it. Nothing we can do about that besides being as loud as possible when that happens (read: don't silently hang).

crusaderky · 2024-02-13T16:55:36Z

Superseded by #dask/dask#10883

fjetter added 3 commits January 18, 2024 11:11

add test that shows how lambda tokenization is broken

2184ed0

more tests

dea649e

proper tests

698769c

rjzamora mentioned this pull request Jan 18, 2024

Use inspect to deterministically tokenize local/lambda functions #767

Closed

fjetter mentioned this pull request Jan 19, 2024

[RFC] Decompose server class dask/distributed#8468

Draft

crusaderky mentioned this pull request Feb 7, 2024

Tokenization meta-issue dask/dask#10905

Closed

crusaderky mentioned this pull request Feb 8, 2024

Deterministic hashing for almost everything dask/dask#10883

Merged

crusaderky closed this Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add test that shows how lambda tokenization is broken #765

add test that shows how lambda tokenization is broken #765

fjetter commented Jan 18, 2024 •

edited

fjetter commented Jan 18, 2024

fjetter commented Jan 18, 2024

mrocklin commented Jan 18, 2024 via email

phofl commented Jan 18, 2024

fjetter commented Jan 18, 2024

mrocklin commented Jan 18, 2024

rjzamora commented Jan 18, 2024

fjetter commented Jan 18, 2024 •

edited

rjzamora commented Jan 18, 2024

fjetter commented Jan 18, 2024

crusaderky commented Feb 7, 2024 •

edited

crusaderky commented Feb 13, 2024

add test that shows how lambda tokenization is broken #765

add test that shows how lambda tokenization is broken #765

Conversation

fjetter commented Jan 18, 2024 • edited

fjetter commented Jan 18, 2024

fjetter commented Jan 18, 2024

mrocklin commented Jan 18, 2024 via email

phofl commented Jan 18, 2024

fjetter commented Jan 18, 2024

Client side

Scheduler side

mrocklin commented Jan 18, 2024

rjzamora commented Jan 18, 2024

fjetter commented Jan 18, 2024 • edited

rjzamora commented Jan 18, 2024

fjetter commented Jan 18, 2024

crusaderky commented Feb 7, 2024 • edited

Status update

crusaderky commented Feb 13, 2024

fjetter commented Jan 18, 2024 •

edited

fjetter commented Jan 18, 2024 •

edited

crusaderky commented Feb 7, 2024 •

edited