Avoid creating an entry whose key is `null` at construction time. #6198

copybara-service · 2022-10-06T20:54:04Z

Avoid creating an entry whose key is null at construction time.

In the case of an entry whose key is stored in a WeakReference, the resulting entry will never be enqueued in the ReferenceQueue, so I'm not sure that it will necessarily ever be removed from the cache.

Compare a similar change to MapMakerInternalMap that was part of cl/479157599.

I get the impression that there may be a similar situation with weak values, but I'm not biting that off right now.

As always, we recommend that you use Caffeine instead of Guava's cache library.

RELNOTES=n/a

In the case of an entry whose key is stored in a `WeakReference`, the resulting entry will never be enqueued in the `ReferenceQueue`, so I'm not sure that it will necessarily ever be removed from the cache. Compare a similar change to `MapMakerInternalMap` that was part of cl/479157599. I get the impression that there may be a similar situation with weak _values_, but I'm not biting that off right now. As always, we [recommend](https://guava.dev/CacheBuilder) that you use [Caffeine](https://github.com/ben-manes/caffeine/wiki) instead of Guava's `cache` library. RELNOTES=n/a PiperOrigin-RevId: 479318112

cpovirk · 2022-10-07T20:18:50Z

@ben-manes FYI

I don't think you have code quite like this in Caffeine, in part because you delegate the work of implementing a hash table to ConcurrentHashMap. And when you do call getKey(), you appear to have been more diligent than us about consistently using the result of the same call. But here's yet another reason to tell people to use Caffeine :)

ben-manes · 2022-10-08T00:14:28Z

Thanks for the ping!

This is an interesting and subtle bug! Like you said, it shouldn't impact Caffeine since we don't have a custom hash table where such mistakes can happen. I understood why Guava decided to fork and disagreed with that decision back then, so took the easier path of delegation. That is less memory optimal for reference caching by wrapping the key/value instead of inlining the fields onto the hash entry. Caffeine optimizes towards size eviction so reference caching is second class as rarer, whereas Guava initially took a big bet on soft references as the best policy and got bit hard when that did not pan out.

Caffeine's tests include a large number of reference tests and it automatically performs an internal state check after every successful test case. All the tests run against Guava as well for a compatibility check, so we could try to write a similar internal state checker if helpful. Unfortunately, I'm afraid that any effort we put into finding more bugs here won't translate into anyone having the time to fix them.

cpovirk · 2022-10-10T20:20:36Z

Yes, please never invest your time in anything that would rely on us to address cache bugs in response :( The problem here wasn't something that we learned about because a user encountered and reported it; instead, a developer happened to notice the identical bug in the MapMakerInternalMap source code as part of implementing #6197 / a2e8f3c.

As a fun twist, the developer was modifying MapMakerInternalMap because even it was not close enough to memory-optimal for one team :) However, that team is an unusual team by Google standards. And they may ultimately end up forking MapMakerInternalMap to make it yet more nearly memory-optimal. I'm not sure if that forking counts as a reason for us to have taken the MapMakerInternalMap approach (since it would make the fork easier) or a reason not to (since it still doesn't save the team that really cares about memory usage from having develop a fork) :)

(The context for all that is weak references and interning. We do seem to be gradually training people to avoid soft references, thankfully.)

ben-manes · 2022-10-10T20:35:19Z

If they care about more memory than concurrency, then maybe forking WeakHashMap is more appropriate?

cpovirk · 2022-10-10T20:53:58Z

Thanks. I think they care about concurrency, too. It's possible that what they'll actually fork in the end is ConcurrentHashMapV8.java, but I imagine that would be tricky :)

ben-manes · 2022-10-10T21:12:37Z

Maybe they can look at Spring's ConcurrentReferenceHashMap which is custom and uses a segmented locking. The JBoss implementation is similar to MapMaker's so might not be much of a difference (Google Collections had ReferenceMap via delegation, and the author felt one upped, so MapMaker was an attempt of showmanship rather than thoughtful design).

justinhorvitz · 2022-10-10T22:04:32Z

Hi, I'm the mysterious unnamed developer. You may remember me from ben-manes/caffeine#568. The project in question is to remove the memory overhead of interning in https://github.com/bazelbuild/bazel. I didn't bother open sourcing the design document, but in summary, the plan is to use existing bazel data structures to find canonical instances, and for various reasons (e.g. to support retrieval of map keys), we are considering maintaining our own concurrent map implementation. Thanks for the suggestion of Spring's implementation. I'm still weighing the pros and cons of each.

ben-manes · 2022-10-10T22:20:42Z

I had a feeling it was you 😄

I believe NonBlockingHashMap was slimmer than the JDK’s, but had footguns. It lacks reference caching, but is a different hash table design for consideration. I think maintaining a generic hash table as a micro optimization is cumbersome, but specializing when truly needed is reasonable. Good luck in finding a nice fit.

justinhorvitz · 2022-10-10T22:23:06Z

Actually, we don't need weak/soft reference support. So that gives us some more options.

ben-manes · 2022-10-10T22:24:00Z

Oh what do you need exactly?

justinhorvitz · 2022-10-10T22:36:33Z

A concurrent map that supports:

Retrieving existing keys (for interning)
A parallel iteration method similar to [ConcurrentHashMap#forEach](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/ConcurrentHashMap.html#forEach(long,java.util.function.BiConsumer). We actually don't need this to be compatible with concurrent updates.
Ideally entries have no more than 3 fields (key, value, next). Hash can be computed on demand by calling key.hashCode().

We might also need the ability to synchronize between two of these maps so that a key never exists in both of the two. I haven't yet decided whether this will require anything special or whether we can get away with e.g. a side effect in computeIfAbsent.

ben-manes · 2022-10-10T22:49:18Z

NBHM offers getk and does not store the hash code. It should be fast for traversal as it does not chain on collision, so might be trivially parallelized. However the key may not be modified even after removal due to tombstoning, among other gotchas. Cliff Click might be approachable if you find it a fit.

cpovirk · 2022-10-11T14:52:16Z

Thanks, Ben.

And ooo, my mistake about weak keys, thanks. That makes NBHM look appealing for another reason (on top of the lack of chaining): If you don't need a WeakReference, then you don't need to store an "entry" object at all, just the single array for keys+values. And IIUC, that's how NBHM is implemented. (Compare our Android RegularImmutableMap, except NBHM is way more sophisticated in order to support modification :))

justinhorvitz · 2022-10-11T15:23:54Z

My mistake actually. We might need weak keys after. I was thinking we could just use the Guava interner, but we need to manually remove from it, and the synchronization about ensuring the key isn’t located in two places may also be tricky without some customization. Still up in the air.

ben-manes · 2022-10-11T19:18:20Z

Assuming NBHM you could reimplement those features on top,

For weak references, you could resurrect com.google.common.collect.ReferenceMap (public in the 2008 snapshot). That appears to work with NBHM. However it uses a dedicated thread (FinalizableReferenceQueue) which caused a lot of pain with being too slow, class unloading, etc. That is very similar to java.lang.ref.Cleaner as a modern alternative. When we inherited MapMaker, we replaced that with an amortized cleanup which imho works better. In that model you simply increment a counter until a threshold, tryLock, sweep a ReferenceQueue, and reset the counter (or some equivalent variant of this).

For synchronizing across data structures, simply use a striped lock (e.g. Striped). A fixed striped lock at a reasonable size, e.g. 1024, is probably good enough. Then reads can remain lock-free, independent writes can run (mostly) in parallel, and you can obtain entry consistency when obtaining the lock.

A little more work than an existing off-the-shelf but less memory overhead, high concurrency, and less tricky code to maintain.

copybara-service bot force-pushed the test_479318112 branch from cb9bf09 to 51456ae Compare October 6, 2022 21:14

copybara-service bot closed this Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid creating an entry whose key is `null` at construction time. #6198

Avoid creating an entry whose key is `null` at construction time. #6198

copybara-service bot commented Oct 6, 2022

cpovirk commented Oct 7, 2022

ben-manes commented Oct 8, 2022

cpovirk commented Oct 10, 2022 •

edited

ben-manes commented Oct 10, 2022

cpovirk commented Oct 10, 2022

ben-manes commented Oct 10, 2022

justinhorvitz commented Oct 10, 2022

ben-manes commented Oct 10, 2022

justinhorvitz commented Oct 10, 2022

ben-manes commented Oct 10, 2022

justinhorvitz commented Oct 10, 2022

ben-manes commented Oct 10, 2022

cpovirk commented Oct 11, 2022

justinhorvitz commented Oct 11, 2022

ben-manes commented Oct 11, 2022

Avoid creating an entry whose key is null at construction time. #6198

Avoid creating an entry whose key is null at construction time. #6198

Conversation

copybara-service bot commented Oct 6, 2022

cpovirk commented Oct 7, 2022

ben-manes commented Oct 8, 2022

cpovirk commented Oct 10, 2022 • edited

ben-manes commented Oct 10, 2022

cpovirk commented Oct 10, 2022

ben-manes commented Oct 10, 2022

justinhorvitz commented Oct 10, 2022

ben-manes commented Oct 10, 2022

justinhorvitz commented Oct 10, 2022

ben-manes commented Oct 10, 2022

justinhorvitz commented Oct 10, 2022

ben-manes commented Oct 10, 2022

cpovirk commented Oct 11, 2022

justinhorvitz commented Oct 11, 2022

ben-manes commented Oct 11, 2022

Avoid creating an entry whose key is `null` at construction time. #6198

Avoid creating an entry whose key is `null` at construction time. #6198

cpovirk commented Oct 10, 2022 •

edited