Unintuitive behavior (bug?) with `put()` #568

justinhorvitz · 2021-06-28T19:59:28Z

Upon migrating from Guava cache to Caffeine in bazel we are seeing some flakiness in tests that heavily exercise weak/soft-valued caches. It appears that the behavior of put() for an already present key in Caffeine diverges from Guava cache (and ConcurrentHashMap) in that a concurrent thread calling getIfPresent can get null, as opposed to either the old value or the new value.

Note that this strange behavior occurs even if the put call does not change the existing value. I believe it's the result of non-atomically clearing the existing value reference before setting it.

The following reproduces the issue within a few iterations:

Cache<String, String> cache = Caffeine.newBuilder().weakValues().build();
String key = "key";
String val = "val";
cache.put("key", "val");

AtomicReference<Exception> e = new AtomicReference<>();
List<Thread> threads = new ArrayList<>();
for (int i = 0; i < 10; i++) {
  int name = i;
  Thread t =
          new Thread(
              () -> {
                for (int j = 0; j < 1000; j++) {
                  if (Math.random() < .5) {
                    cache.put(key, val);
                  } else if (cache.getIfPresent(key) == null) {
                    e.compareAndSet(
                        null,
                        new IllegalStateException(
                            "Thread " + name + " observed null on iteration " + j));
                    break;
                  }
                }
              });
  threads.add(t);
  t.start();
}

for (Thread t : threads) {
  t.join();
}

if (e.get() != null) {
  throw e.get();
}

The text was updated successfully, but these errors were encountered:

ben-manes · 2021-06-28T20:21:14Z

Thanks for the report and isolating where the problem is likely stemming from. From reviewing that setter logic, I think you're right as to where the bug is. I can confirm that if I remove the clear() statement then your test passes.

The simplest fix would be to remove the clear() when setting. This would add a little more GC churn, but should be okay.
The second option is to clear after the reference is set and make getValue() resilient to a stale read. It could race to read the old Reference prior to the set, but be context switched out to after the clear() and observe a null value. This would mean that a getValue() would need a retry loop and a volatile read barrier.

Do you have any preference / recommendation on which strategy we should prefer?

I can try to get a bug fix out tonight (or this week if not) for 2.x and 3.x development lines.

justinhorvitz · 2021-06-28T20:59:50Z

I don’t think I know the architecture well enough to make a recommendation. Generally we just want some value instead of null.

Can you explain what the extra gc churn would be from? Maybe there’s a way to special case and do nothing if the value is the same (kind of like putIfAbsent).

Thanks for the quick attention.

ben-manes · 2021-06-28T21:22:54Z

The GC has to clear out all Reference objects explicitly when collecting the entry. That requires tracking weak/soft references, which adds additional overhead (your colleague Jeremy Manson wrote about this in his blog). The (hope) is that by explicitly clearing the reference, the GC does a little less work. We want to be GC hygienic, e.g. j.u.LinkedList#clear() nulls out the elements rather than merely setting the head/tail to null in order to assist if the nodes inhabit multiple generations (thereby requiring a full GC to collect).

Since GC is magical and we'd need an expert like Jeremy to tell us the optimal incarnation, the clear() is merely a best practice in hopes of helping. It might not be doing anything worthwhile. Your core java team might have some insights here.

The other option is to check on read if a stale reference was observed. If null, then a new read barrier ensures that the thread sees the prior write (and the next iteration's plain read cannot go back in time). That would look something like,

public final V getValue() {
  for (;;) {
    Reference<V> ref = ((Reference<V>) VALUE.get(this));
    V value = ref.get();
    if ((value != null) || (ref == VALUE.getVolatile(this))) {
      return value;
    }
  }
}

public final void setValue(V value, ReferenceQueue<V> referenceQueue) {
  Reference<V> ref = ((Reference<V>) getValueReference());
  VALUE.set(this, new WeakValueReference<V>(getKeyReference(), value, referenceQueue));
  ref.clear();  
}

The slightly higher cost in user code might be preferable to adding GC work. This is very much a micro-optimization, but generally worth considering in data structure classes.

justinhorvitz · 2021-06-29T01:53:44Z

Unfortunately I'm finding that while the proposed solutions fix my provided repro, they don't actually fix the issue in our tests.

I don't have a full explanation for what's going on yet. Upon adding more debugging info, I am seeing removal notifications of type COLLECTED but with a non-null value - is that ever supposed to happen? I'm seeing this both with and without the proposed changes discussed above. It seems to happen right after the call to put. When I get such a removal notification, I verified that I still have a strong reference to the object, so it should not have been collected.

ben-manes · 2021-06-29T02:30:04Z

Can you narrow it down to a unit test? Let's debug only with the removal of Reference.clear() as the safest option.

A COLLECTED may occur if either the key or value is null. It should not occur if both are non-null.

justinhorvitz · 2021-06-29T02:38:48Z

I'll try to get a better explanation tomorrow, but I'm definitely seeing a COLLECTED notification with a non-null key and value just prior to the test failing with the missing cache entry.

I grabbed a stack trace of what's submitting the weird removal notification:

com.github.benmanes.caffeine.cache.BoundedLocalCache.notifyRemoval(BoundedLocalCache.java:331)
com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntry(BoundedLocalCache.java:979)
com.github.benmanes.caffeine.cache.BoundedLocalCache.drainValueReferences(BoundedLocalCache.java:1523)
com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1483)
com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1454)
com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3299)
com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.exec(BoundedLocalCache.java:3286)
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

We're using strong keys & weak values, version 2.8.5.

justinhorvitz · 2021-06-29T02:41:35Z

Line numbers are probably off because I added some debugging code.

ben-manes · 2021-06-29T03:20:18Z

hmm.. I do know that there is a bug in Guava where that can occur. Nothing jumps out when scanning over the code and it seems correct by checking for null first before supplying that cause. Thanks for digging into this.

ben-manes · 2021-06-29T03:26:48Z

Thinking about solution 1 of not calling Reference#clear() and I don't think that will actually work. Since the value is eligible for collection at any time, the GC might clear it by the time we obtain the null referent. The retry loop to validate that we correctly observe null as the live mapping is therefore required.

…ixes #568)

ben-manes · 2021-06-29T06:54:13Z

Please try this patch in your tests and see if it resolves the original problem. This implements the retry loop to resolve the race.

justinhorvitz · 2021-06-29T14:43:01Z

I think that the issue as originally described is not actually affecting our tests, I just discovered it by accident. Instead, my understanding is that we're encountering a race between cleanup of an evicted value and a call to put to update that value.

During cleanup, we check that the node's value reference is still the one that was enqueued.
A concurrent call to put updates the node's value reference to a live value.
The call to evictEntry proceeds to obtain a lock on the node, but does not check whether the node was updated with a new value. Thus we proceed to clean up a live entry.

With that understanding, I modified evictEntry with the following check and the issue disappeared, even without any of the other changes we discussed:

synchronized (n) {
  value[0] = n.getValue();
  if (cause == RemovalCause.COLLECTED && value[0] != null) {
    resurrect[0] = true;
    return n;
  }
}

Does this explanation and solution make sense to you?

I still think we should fix the originally reported issue as well.

ben-manes · 2021-06-29T16:44:35Z

That's another excellent find! Yes, lets fix both.

I think that we can write a unit test by downcasting to the implementation (BoundedLocalCacheTest), clearing the reference to simulate a garbage collection, and throwing threads to catch the race.

justinhorvitz · 2021-06-29T18:22:43Z

Here's a stress test that seems to reproduce the 2nd issue:

AtomicReference<Exception> e = new AtomicReference<>();
Cache<String, Object> cache =
        Caffeine.newBuilder()
            .weakValues()
            .removalListener(
                (k, v, cause) -> {
                  if (cause == RemovalCause.COLLECTED && v != null) {
                    e.compareAndSet(null, new IllegalStateException("Evicted a live value: " + v));
                  }
                })
            .build();

String key = "key";
cache.put(key, new Object());

List<Thread> threads = new ArrayList<>();
AtomicBoolean missing = new AtomicBoolean();
for (int i = 0; i < 100; i++) {
  Thread t =
          new Thread(
              () -> {
                for (int j = 0; j < 1000; j++) {
                  if (e.get() != null) {
                    break;
                  }
                  if (Math.random() < .01) {
                    System.gc();
                    cache.cleanUp();
                  } else if (cache.getIfPresent(key) == null && !missing.getAndSet(true)) {
                    cache.put(key, new Object());
                    missing.set(false);
                  }
                }
              });
    threads.add(t);
    t.start();
}

for (Thread t : threads) {
  t.join();
}

if (e.get() != null) {
  throw e.get();
}

As for the fix for the first issue - I don't understand why a loop is necessary. Can we just follow up the non-volatile read with a volatile read?

I'm not sure of the best way to write the tests with the existing framework, so I'm hoping you can help me get these ideas merged.

ben-manes · 2021-06-29T20:50:51Z

Thank you @justinhorvitz. Sorry that I haven't been able to focus on this much (its a meeting day for me). My plan is to translate your work into our test framework and a commit, and add you to co-authored-by for attribution. Of course you are welcome to send a PR but that is not necessary.

As for the fix for the first issue - I don't understand why a loop is necessary. Can we just follow up the non-volatile read with a volatile read?

The problem is that we have an indirection, value -> Reference -> referent. At any point the thread may be context switched out and the environment changed (referent was collected, value was replaced). This indirection means that even if we only used a volatile read then a race can occur, as our decision depends on dereferencing the state and assuming nothing changed. We can only get away with not validating on a non-null read because non-determinism is expected, but as you observed for a null that may never be an accurate visible result. The validation loop allows us to detect that. If we did not loop, then the if the volatile read differed we still couldn't trust the new referent as it too might have been collected. The context switch duration might be in the distant future.

The non-volatile read is there because we can piggyback on the Map.get's volatile read. This is a memory read barrier, which enforces a point-in-time visibility within the MESI cache coherence protocol. This plain read means that we know the value up until the time of the Map.get, so as a micro-optimization we don't need stronger consistency of a few instructions later. Yet that can be inconsistent, so we need a volatile read to set a new point-in-time barrier for validation check.

justinhorvitz · 2021-06-29T23:15:20Z

Makes sense. We're trying to distinguish between a GC clear (returning null is valid) and a stale read (need to read the updated reference). I guess I would write it as such, to avoid the double read when it loops:

public final V getValue() {
  Reference<V> ref = (Reference<V>) VALUE.get(this);
  for (;;) {
    V value = ref.get();
    if (value != null) {
      return value;
    }
    Reference<V> newRef = (Reference<V>) VALUE.getVolatile(this);
    if (newRef == ref) {
      return null; // Reference is up to date - it must have been cleared by GC.
    }
    ref = newRef;
  }
}

Also wonder if there are any cases where getValueReference would need to be implemented accordingly.

ben-manes · 2021-06-30T04:50:31Z

Yeah, I suppose it's a preference of which is more readable? The shorter code is concise and I'd assume that a plain read would be free, as a VarHandle is an intrinsic. But perhaps it is less obvious compared to your longer form. I'll review for style before merging the fixes, as I'm not sure which I prefer.

It looks like getValueReference is only called by drainValueReferences(). As ReferenceQueue#poll() is synchronized, each iteration forces a new memory barrier we can piggyback on. Thanks to your resurrection patch the interactions look safe.

1. When an entry is updated then a concurrent reader should either observe the old or new value. This operation replaces the j.l.Reference instance stored on the entry and the old referent becomes eligible for garbage collection. A reader holding the stale Reference may therefore return a null value, which is more likely due to the cache proactively clearing the referent to assist the garbage collector. When a null value is read then the an extra volatile read is used to validate that the Reference instance is still held by the entry. This retry loop has negligible cost. 2. When an entry is eligible for removal due to its value being garbage collected, during the eviction's atomic map operation this eligibility must be verified. If concurrently the entry was resurrected and a new value set, then the cache writer has already dispatched the removal notification and established a live mapping. If the evictor does not detect that the cause is no longer valid, then it would incorrectly discard the mapping with a removal notification containing a non-null key, non-null value, and collected removal cause. Like expiration and size policies, the reference eviction policy will now validate and no-op if the entry is no longer eligible. 3. When the fixed expiration setting is dynamically adjusted, an expired entry may be resurrected as no longer eligible for removoal. While the map operation detected this case, stemming from the entry itself being updated and its lifetime reset, the outer eviction loop could retry indefinitely. A stale read of the fixed duration caused the loop to retry the ineligible entry, but instead it can terminatee as it scans a queue ordered by the expiration timestamp. Co-authored-by: Justin Horvitz <jhorvitz@google.com>

1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance stored on the entry and the old referent becomes eligible for garbage collection. A reader holding the stale Reference may therefore return a null value, which is more likely due to the cache proactively clearing the referent to assist the garbage collector. When a null value is read then an extra volatile read is used to validate that the Reference instance is still held by the entry. This retry loop has negligible cost. 2. When an entry is eligible for removal due to its value being garbage collected, then during the eviction's atomic map operation this eligibility must be verified. If concurrently the entry was resurrected and a new value set, then the cache writer has already dispatched the removal notification and established a live mapping. If the evictor does not detect that the cause is no longer valid, then it would incorrectly discard the mapping with a removal notification containing a non-null key, non-null value, and collected removal cause. Like expiration and size policies, the reference eviction policy will now validate and no-op if the entry is no longer eligible. 3. When the fixed expiration setting is dynamically adjusted, an expired entry may be resurrected as no longer eligible for removal. While the map operation detected this case, stemming from the entry itself being updated and its lifetime reset, the outer eviction loop could retry indefinitely due to a stale read of the fixed duration. This caused the loop to retry the ineligible entry, but instead it can terminate when eviction fails because it scans a queue ordered by the expiration timestamp. Co-authored-by: Justin Horvitz <jhorvitz@google.com>

ben-manes · 2021-07-01T09:17:49Z

When adding more coverage on resurrected entries, I found a bug if the fixed expiration time is updated. This was stuck in a retry loop since it did not re-read the duration but rather kept it in a stale variable outside of the loop. Thus it kept evaluating the next entry as expired. Since this queue is ordered the loop can terminate when evictEntry fails.

I'll backport these changes to 2.x tomorrow and cut a release. You should have a new version (2.x, 3.x preferred) by Friday morning.

1. When an entry is updated then a concurrent reader should observe either the old or new value. This operation replaces the j.l.Reference instance stored on the entry and the old referent becomes eligible for garbage collection. A reader holding the stale Reference may therefore return a null value, which is more likely due to the cache proactively clearing the referent to assist the garbage collector. When a null value is read then an extra volatile read is used to validate that the Reference instance is still held by the entry. This retry loop has negligible cost. 2. When an entry is eligible for removal due to its value being garbage collected, then during the eviction's atomic map operation this eligibility must be verified. If concurrently the entry was resurrected and a new value set, then the cache writer has already dispatched the removal notification and established a live mapping. If the evictor does not detect that the cause is no longer valid, then it would incorrectly discard the mapping with a removal notification containing a non-null key, non-null value, and collected removal cause. Like expiration and size policies, the reference eviction policy will now validate and no-op if the entry is no longer eligible. 3. When the fixed expiration setting is dynamically adjusted, an expired entry may be resurrected as no longer eligible for removal. While the map operation detected this case, stemming from the entry itself being updated and its lifetime reset, the outer eviction loop could retry indefinitely due to a stale read of the fixed duration. This caused the loop to retry the ineligible entry, but instead it can terminate when eviction fails because it scans a queue ordered by the expiration timestamp. Co-authored-by: Justin Horvitz <jhorvitz@google.com>

ben-manes · 2021-07-02T04:26:23Z

Released in 2.9.2 and 3.0.3. Thanks again for all your help on this.

justinhorvitz · 2021-07-02T12:48:41Z

Thank you!

Contains critical fixes for ben-manes/caffeine#568. Closes #13651. Signed-off-by: Philipp Wollermann <philwo@google.com>

ben-manes added a commit that referenced this issue Jun 29, 2021

Fix racy read of null weak/soft old value when writing a new value (f…

afec608

…ixes #568)

ben-manes closed this as completed in 542c308 Jul 1, 2021

justinhorvitz mentioned this issue Jul 7, 2021

Update to caffeine 2.9.2 bazelbuild/bazel#13651

Closed

bazel-io pushed a commit to bazelbuild/bazel that referenced this issue Jul 7, 2021

Update //third_party/caffeine to v2.9.2

bcc316c

Contains critical fixes for ben-manes/caffeine#568. Closes #13651. Signed-off-by: Philipp Wollermann <philwo@google.com>

justinhorvitz mentioned this issue Oct 10, 2022

Avoid creating an entry whose key is null at construction time. google/guava#6198

Closed

hamadodene mentioned this issue Feb 21, 2024

Flaky tests diennea/carapaceproxy#439

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unintuitive behavior (bug?) with `put()` #568

Unintuitive behavior (bug?) with `put()` #568

justinhorvitz commented Jun 28, 2021

ben-manes commented Jun 28, 2021

justinhorvitz commented Jun 28, 2021

ben-manes commented Jun 28, 2021

justinhorvitz commented Jun 29, 2021

ben-manes commented Jun 29, 2021

justinhorvitz commented Jun 29, 2021

justinhorvitz commented Jun 29, 2021

ben-manes commented Jun 29, 2021

ben-manes commented Jun 29, 2021 •

edited

ben-manes commented Jun 29, 2021

justinhorvitz commented Jun 29, 2021

ben-manes commented Jun 29, 2021

justinhorvitz commented Jun 29, 2021

ben-manes commented Jun 29, 2021

justinhorvitz commented Jun 29, 2021

ben-manes commented Jun 30, 2021

ben-manes commented Jul 1, 2021

ben-manes commented Jul 2, 2021

justinhorvitz commented Jul 2, 2021

Unintuitive behavior (bug?) with put() #568

Unintuitive behavior (bug?) with put() #568

Comments

justinhorvitz commented Jun 28, 2021

ben-manes commented Jun 28, 2021

justinhorvitz commented Jun 28, 2021

ben-manes commented Jun 28, 2021

justinhorvitz commented Jun 29, 2021

ben-manes commented Jun 29, 2021

justinhorvitz commented Jun 29, 2021

justinhorvitz commented Jun 29, 2021

ben-manes commented Jun 29, 2021

ben-manes commented Jun 29, 2021 • edited

ben-manes commented Jun 29, 2021

justinhorvitz commented Jun 29, 2021

ben-manes commented Jun 29, 2021

justinhorvitz commented Jun 29, 2021

ben-manes commented Jun 29, 2021

justinhorvitz commented Jun 29, 2021

ben-manes commented Jun 30, 2021

ben-manes commented Jul 1, 2021

ben-manes commented Jul 2, 2021

justinhorvitz commented Jul 2, 2021

Unintuitive behavior (bug?) with `put()` #568

Unintuitive behavior (bug?) with `put()` #568

ben-manes commented Jun 29, 2021 •

edited