Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use hash_funcs to compute an @st.cache function's cache_key #2331

Merged
merged 3 commits into from Nov 10, 2020

Conversation

tconkling
Copy link
Contributor

Only use hash_funcs for computing the value_key for a cached function value.

Fixes #2328

A longer explanation, from that bug:

There are two parts to cache retrieval for @st.cache:

  1. Retrieve the decorated function's MemCache instance. We use a (func.__module__, func.__qualname__, func) tuple to get the cache_key that uniquely identifies the function. No two functions (even if they have the same name and body) will share the same MemCache.

  2. Retrieve the cached value from the function's MemCache. This is where we hash the function's arguments to produce the value_key for looking up the value within the MemCache.

We currently pass hash_funcs when computing both cache_key and value_key. This is generally innocuous, since cache_key uses two strings and a function's AST as hash values. However, if you supply a hash_func that operates on string values, you run the risk of having two different functions resolve the same cache_key, and end up unexpectedly sharing a MemCache instance.

See this forum issue for an example of this bug in action.

In short, we should not pass hash_funcs to the cache_key hasher. We never want different functions to share the same MemCache instance. (Passing hash_funcs here was an oversight - the solution is to just pass hash_funcs=None!)

@tconkling tconkling requested a review from a team November 9, 2020 23:34
@jrhone
Copy link
Contributor

jrhone commented Nov 9, 2020

Are you sure we don't want to use hash funcs at all when creating the cache key?

Or do we just not want to use it when hashing the module and function name?

@tconkling
Copy link
Contributor Author

Yeah, we never want to use it when creating the cache_key, only when creating the value_key. hash_funcs is for hashing the parameters that are sent to a @st.cache-decorated function as an escape hatch for values we don't know how to hash.

Using hash_funcs in the computation of the cache_key breaks the API promise of having completely independent caches for each @st.cache function (and doesn't provide any user benefit - the existence of cache-key is purely an under-the-hood detail to implement the per-function cache).

Copy link
Collaborator

@kmcgrady kmcgrady left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to make sense for me.

Checked the docs to make sure there's no confusion, and the way I interpret it is how to hash the parameters, which means this should be a value_key hashing only and not the cache_key.

I originally wondered if there would be a reason to cache across multiple functions, and I answered "probably not". On top of that, we can wrap the functions with a cached function for that exact example if needed (ie there's a workaround likely for this incredibly rare scenario).

@jrhone
Copy link
Contributor

jrhone commented Nov 10, 2020

I believe when we hash functions we hash both the byte code and the referenced objects. So it's possible we'd try to hash an object that we don't know how to hash while generating the cache key.

When looking at your PR comment below, it seems like we don't want to use hash funcs with func.__module__ and func.__qualname__, but we do want it for func.

We use a (func.module, func.qualname, func) tuple to get the cache_key

@tconkling
Copy link
Contributor Author

tconkling commented Nov 10, 2020

Yeah, @jrhone is correct, and I completely overlooked this. When we hash the function itself, it's not just the AST, it's also all objects on the stack that are referenced by the function.

We could do a two-phase cache-key generation:

# Include the function's module and qualified name in the hash.
update_hash(
    (func.__module__, func.__qualname__),
    hasher=func_hasher,
    hash_funcs=None,
    hash_reason=HashReason.CACHING_FUNC_BODY,
    hash_source=func,
)

# Include the function's body in the hash. We _do_ pass hash_funcs here,
# because this step will be hashing any objects referenced in the function
# body.
update_hash(
    func,
    hasher=func_hasher,
    hash_funcs=hash_funcs,
    hash_reason=HashReason.CACHING_FUNC_BODY,
    hash_source=func,
)

cache_key = func_hasher.hexdigest()

However, this doesn't address a possibly deeper issue, which is that many of our special-cased hashing functions use strings and id(...)s from inside other objects to calculate hashes. If the user supplies a str or int hash_func, they're going to have a potentially massive minefield no matter what.

Maybe we should just disallow this entirely?

@tconkling tconkling marked this pull request as draft November 10, 2020 00:46
@tconkling
Copy link
Contributor Author

I've fixed the issue that @jrhone pointed out, but -

I'm pulling this and changing it to a draft for now. I think we want to come to a better understanding of the right way forward for "catch-all" hash_funcs (like str or int).

@kmcgrady kmcgrady self-requested a review November 10, 2020 01:21
@tconkling tconkling marked this pull request as ready for review November 10, 2020 23:39
@tconkling tconkling merged commit b7ccd70 into streamlit:develop Nov 10, 2020
@tconkling tconkling deleted the tim/FixFunctionCache branch November 10, 2020 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

caching: multiple functions can unexpectedly share a MemCache key
3 participants