Don't use hash_funcs to compute an @st.cache function's `cache_key` #2331

tconkling · 2020-11-09T23:34:32Z

Only use hash_funcs for computing the value_key for a cached function value.

A longer explanation, from that bug:

There are two parts to cache retrieval for @st.cache:

Retrieve the decorated function's MemCache instance. We use a (func.__module__, func.__qualname__, func) tuple to get the cache_key that uniquely identifies the function. No two functions (even if they have the same name and body) will share the same MemCache.
Retrieve the cached value from the function's MemCache. This is where we hash the function's arguments to produce the value_key for looking up the value within the MemCache.

We currently pass hash_funcs when computing both cache_key and value_key. This is generally innocuous, since cache_key uses two strings and a function's AST as hash values. However, if you supply a hash_func that operates on string values, you run the risk of having two different functions resolve the same cache_key, and end up unexpectedly sharing a MemCache instance.

See this forum issue for an example of this bug in action.

In short, we should not pass hash_funcs to the cache_key hasher. We never want different functions to share the same MemCache instance. (Passing hash_funcs here was an oversight - the solution is to just pass hash_funcs=None!)

jrhone · 2020-11-09T23:44:24Z

Are you sure we don't want to use hash funcs at all when creating the cache key?

Or do we just not want to use it when hashing the module and function name?

tconkling · 2020-11-09T23:56:42Z

Yeah, we never want to use it when creating the cache_key, only when creating the value_key. hash_funcs is for hashing the parameters that are sent to a @st.cache-decorated function as an escape hatch for values we don't know how to hash.

Using hash_funcs in the computation of the cache_key breaks the API promise of having completely independent caches for each @st.cache function (and doesn't provide any user benefit - the existence of cache-key is purely an under-the-hood detail to implement the per-function cache).

kmcgrady

Seems to make sense for me.

Checked the docs to make sure there's no confusion, and the way I interpret it is how to hash the parameters, which means this should be a value_key hashing only and not the cache_key.

I originally wondered if there would be a reason to cache across multiple functions, and I answered "probably not". On top of that, we can wrap the functions with a cached function for that exact example if needed (ie there's a workaround likely for this incredibly rare scenario).

jrhone · 2020-11-10T00:25:35Z

I believe when we hash functions we hash both the byte code and the referenced objects. So it's possible we'd try to hash an object that we don't know how to hash while generating the cache key.

When looking at your PR comment below, it seems like we don't want to use hash funcs with func.__module__ and func.__qualname__, but we do want it for func.

We use a (func.module, func.qualname, func) tuple to get the cache_key

tconkling · 2020-11-10T00:38:55Z

Yeah, @jrhone is correct, and I completely overlooked this. When we hash the function itself, it's not just the AST, it's also all objects on the stack that are referenced by the function.

We could do a two-phase cache-key generation:

# Include the function's module and qualified name in the hash.
update_hash(
    (func.__module__, func.__qualname__),
    hasher=func_hasher,
    hash_funcs=None,
    hash_reason=HashReason.CACHING_FUNC_BODY,
    hash_source=func,
)

# Include the function's body in the hash. We _do_ pass hash_funcs here,
# because this step will be hashing any objects referenced in the function
# body.
update_hash(
    func,
    hasher=func_hasher,
    hash_funcs=hash_funcs,
    hash_reason=HashReason.CACHING_FUNC_BODY,
    hash_source=func,
)

cache_key = func_hasher.hexdigest()

However, this doesn't address a possibly deeper issue, which is that many of our special-cased hashing functions use strings and id(...)s from inside other objects to calculate hashes. If the user supplies a str or int hash_func, they're going to have a potentially massive minefield no matter what.

Maybe we should just disallow this entirely?

tconkling · 2020-11-10T01:02:14Z

I've fixed the issue that @jrhone pointed out, but -

I'm pulling this and changing it to a draft for now. I think we want to come to a better understanding of the right way forward for "catch-all" hash_funcs (like str or int).

Don't use hash_funcs to compute an @st.cache function's cache_key

efdb380

tconkling requested a review from a team November 9, 2020 23:34

kmcgrady approved these changes Nov 10, 2020

View reviewed changes

tconkling marked this pull request as draft November 10, 2020 00:46

Don't use hash_funcs to compute an @st.cache function's cache_key

001f097

kmcgrady self-requested a review November 10, 2020 01:21

Remove superfluous test code

1895cdf

kmcgrady approved these changes Nov 10, 2020

View reviewed changes

tconkling marked this pull request as ready for review November 10, 2020 23:39

tconkling merged commit b7ccd70 into streamlit:develop Nov 10, 2020

tconkling deleted the tim/FixFunctionCache branch November 10, 2020 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use hash_funcs to compute an @st.cache function's `cache_key` #2331

Don't use hash_funcs to compute an @st.cache function's `cache_key` #2331

tconkling commented Nov 9, 2020

jrhone commented Nov 9, 2020

tconkling commented Nov 9, 2020

kmcgrady left a comment

jrhone commented Nov 10, 2020

tconkling commented Nov 10, 2020 •

edited

tconkling commented Nov 10, 2020

Don't use hash_funcs to compute an @st.cache function's cache_key #2331

Don't use hash_funcs to compute an @st.cache function's cache_key #2331

Conversation

tconkling commented Nov 9, 2020

jrhone commented Nov 9, 2020

tconkling commented Nov 9, 2020

kmcgrady left a comment

Choose a reason for hiding this comment

jrhone commented Nov 10, 2020

tconkling commented Nov 10, 2020 • edited

tconkling commented Nov 10, 2020

Don't use hash_funcs to compute an @st.cache function's `cache_key` #2331

Don't use hash_funcs to compute an @st.cache function's `cache_key` #2331

tconkling commented Nov 10, 2020 •

edited