Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add test that shows how lambda tokenization is broken #765

Closed
wants to merge 3 commits into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jan 18, 2024

#223 had good intention and introduced a utility function _tokenize_deterministic that is setting the tokenize.ensure-deterministic flag such that we can ensure that the tokens we're generating are indeed deterministic.
Determinism in this context is mildly ambiguous and there are subtly different consistency guarantees we need if we want to pickle Expr objects in a way that is deterministic across python processes/interpreters

Essentially everything we'd need to fall back to cloudpickle to serialize is also susceptible to not being deterministically tokenized (therefore this loosely relates to #331 (comment) even though those are separate issues) since the tokenization is using cloudpickle behind the scenes for those cases. In way that defeats the entire purpose of the tokenization engine (see point 3. below).

#223 also implemented a "fix" for LambdaType functions which is using the str representation of those functions which looks like <function func.<locals>.<lambda> at 0x107363be0>. This is unique inside of an interpreter session since it includes the memory address where the lambda is stored but this changes once moved to another interpreter.

This effectively hard blocks #14

There are a couple of options (I'm not really happy about either but here we go...)

  1. Enforce pickle as a serialization protocol and forbid usage of local functions, types, etc. (effectively Add custom serialization check to assert_eq #331 but this would expand to all user APIs. We could wrap user defined objects with some magical class but this would feel a little awkward I think)
  2. A slightly better/milder version of 1). I believe we can define deterministic cross-interpreter tokens for lambdas/local functions. I had something like this in a version of Ensure tokenize is consistent for pickle roundtrips dask#10808 but it turned out to not be python version stable so I ended up removing it again. I'd have to run a couple
  3. Abandon tokenization entirely. Inside of an interpreter this problem is easy since if an object doesn't implement a __hash__ we could just use it's id for a unique identifier and hash (or just stick with the current engine although it feels slightly over engineered then). From what I understand the entire tokenization is merely for the distributed scheduler but I see a way how to avoid that. This would require us to rewrite how client and scheduler are communicating in dask/distributed but I believe this is doable with a reasonable amount of effort. The scheduler would de-facto be the source of truth and be responsible for key generation and not the other way round. Locally generated keys would no longer map to remote keys. That would be fine, I guess?

cc @rjzamora @mrocklin @phofl

Note: This problem has been around forever. HLGs were not as much affected by this since their names are actually not unique (e.g. dask/dask#9888). Expr are calculating names/tokens recursively which is objectively much better for key uniqueness but also means that if there is any object inside of the expr-tree that cannot be tokenized deterministically, this affects the entire tree.

@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2024

Unless there are other options I am not seeing, 3) would be the most permissive thinking about users. Since this is also the one with the most uncertainty I'll probably invest some time and whip up a prototype outlining what I'm thinking.

Short term we may just postpone scheduler integration since I don't think this is critical for a release but I'd like to know what our options are. If we were forced to go with option 1) it would be good to know before the release

@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2024

There is also the option of making cloudpickle actually deterministic. There has been a little progress on that front cloudpipe/cloudpickle#453 cloudpipe/cloudpickle#524

@mrocklin
Copy link
Member

mrocklin commented Jan 18, 2024 via email

@phofl
Copy link
Collaborator

phofl commented Jan 18, 2024

The assumption that the name attribute of an expression is the same if the expression has exactly the same operands is deeply baked into our optimization logic

@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2024

Apart form the optimization logic itself, the way distributed is currently built is roughly as

Client side

  • Take expr and optimize it
  • Generate the final output keys
  • Use those keys to generate Futures we can fetch and listen to

Scheduler side

  • Take the expression and materialize it
  • Compute stuff
  • If one of the keys that the client asked for is ready, let the client know (If the keys generated on client side don't match with the actual graph keys we'll deadlock because the client is waiting for the results of keys that don't exist

Changing this procedure is what I meant with 3) but this is a bit of work. I was hoping to not have to do this, particularly not while still supporting HLGs

@mrocklin
Copy link
Member

One dumb approach (maybe not the approach that we want to take) is that we optionally store a finalized key on the expression before we serialize it. So on the client side we do all of our optimizations, get a nice expr that we like, finalize the key on the expression, then serialize it up to the scheduler. The _name property checks for that finalized key and if it exists uses it instead of tokenization.

This seems like it would be easy short term. It would stop us from doing scheduler-side optimization (which I think we're all excited about medium-term) but might give us the fast client-scheduler communications we want in the meantime.

@rjzamora
Copy link
Member

I haven't looked carefully through this code or other linked work yet. However, just to check my rough understanding - It seems like there are two challenges here:

  • Client-side dask-expr depends on deterministic tokenization to function. We identify an expression by its name, and its name depends directly on the tokenization of its operands. There are some cases (e.g. local functions) where the necessary "determinism" is broken and/or it limits what a user can do. Possible solutions:
    • Find a way to avoid tokenization (seems difficult)
    • Use something like inspect.getsource to automatically convert local/lambda objects to a deterministic strings (probably still has problems/limitations)
    • Something else...
  • In order for us to materialize expressions on the scheduler, we need the expression names to be the same on the client and scheduler. However, even some objects that pass the _tokenize_deterministic test will not actually tokenize to the same value on the client and scheduler. Possible solutions:
    • We do something like Matts last suggestion. To unlock scheduler-side optimizations in the future, it may even be possible to "freeze" the keys on the client and then "remap" them on the scheduler. Things should be okay if the scheduler keeps track of the mapping between the new and original keys (maybe?).
    • Something else...

@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2024

I think I'd be fine with a finalized name after optimization

@rjzamora
Copy link
Member

@fjetter - Just a note that I submitted #767 to demonstrate how the tests added in this PR will pass if we use inspect to normalize local/lambda functions.

@fjetter
Copy link
Member Author

fjetter commented Jan 18, 2024

reposting here

the most recent commit on main of cloudpickle (not released yet) likely solves the lambda problem but there are still cases that are not handled well.

It's easy for us to verify on scheduler side if the tokenization was successful so we can abort with a clear exception. Maybe the new cloudpickle plus that scheduler side verification is good enough for us?

@crusaderky
Copy link
Collaborator

crusaderky commented Feb 7, 2024

Status update

The PRs linked from dask/dask#10905 make lambdas tokenization deterministic, with the current released version of cloudpickle, on the same Python interpreter. Nothing should be expected across processes or hosts.

So we'll either need tokens to be generated on the client and then shipped to the scheduler, or the scheduler to generate them first and ship them back to the client (the latter is more involved because as of today we use batched comms, which is one-directional, but the client would need to be informed of what keys they just gained in who_wants).

It would be trivial to add a single line to the dask tokenize function that bakes in an unique interpreter ID, so that we can see things fail hard and fast in trivial use cases instead of only on the edge cases of cloudpickle.

It of course remains possible to have some obscure third-party object (callable or not) that doesn't cloudpickle deterministically AND comes with neither __dask_tokenize__ method nor normalize_token.register attached to it. Nothing we can do about that besides being as loud as possible when that happens (read: don't silently hang).

@crusaderky
Copy link
Collaborator

Superseded by #dask/dask#10883

@crusaderky crusaderky closed this Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants