Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unintuitive behavior (bug?) with put() #568

Closed
justinhorvitz opened this issue Jun 28, 2021 · 19 comments
Closed

Unintuitive behavior (bug?) with put() #568

justinhorvitz opened this issue Jun 28, 2021 · 19 comments

Comments

@justinhorvitz
Copy link
Contributor

Upon migrating from Guava cache to Caffeine in bazel we are seeing some flakiness in tests that heavily exercise weak/soft-valued caches. It appears that the behavior of put() for an already present key in Caffeine diverges from Guava cache (and ConcurrentHashMap) in that a concurrent thread calling getIfPresent can get null, as opposed to either the old value or the new value.

Note that this strange behavior occurs even if the put call does not change the existing value. I believe it's the result of non-atomically clearing the existing value reference before setting it.

The following reproduces the issue within a few iterations:

Cache<String, String> cache = Caffeine.newBuilder().weakValues().build();
String key = "key";
String val = "val";
cache.put("key", "val");

AtomicReference<Exception> e = new AtomicReference<>();
List<Thread> threads = new ArrayList<>();
for (int i = 0; i < 10; i++) {
  int name = i;
  Thread t =
          new Thread(
              () -> {
                for (int j = 0; j < 1000; j++) {
                  if (Math.random() < .5) {
                    cache.put(key, val);
                  } else if (cache.getIfPresent(key) == null) {
                    e.compareAndSet(
                        null,
                        new IllegalStateException(
                            "Thread " + name + " observed null on iteration " + j));
                    break;
                  }
                }
              });
  threads.add(t);
  t.start();
}

for (Thread t : threads) {
  t.join();
}

if (e.get() != null) {
  throw e.get();
}
@ben-manes
Copy link
Owner

Thanks for the report and isolating where the problem is likely stemming from. From reviewing that setter logic, I think you're right as to where the bug is. I can confirm that if I remove the clear() statement then your test passes.

  1. The simplest fix would be to remove the clear() when setting. This would add a little more GC churn, but should be okay.
  2. The second option is to clear after the reference is set and make getValue() resilient to a stale read. It could race to read the old Reference prior to the set, but be context switched out to after the clear() and observe a null value. This would mean that a getValue() would need a retry loop and a volatile read barrier.

Do you have any preference / recommendation on which strategy we should prefer?

I can try to get a bug fix out tonight (or this week if not) for 2.x and 3.x development lines.

@justinhorvitz
Copy link
Contributor Author

I don’t think I know the architecture well enough to make a recommendation. Generally we just want some value instead of null.

Can you explain what the extra gc churn would be from? Maybe there’s a way to special case and do nothing if the value is the same (kind of like putIfAbsent).

Thanks for the quick attention.

@ben-manes
Copy link
Owner

The GC has to clear out all Reference objects explicitly when collecting the entry. That requires tracking weak/soft references, which adds additional overhead (your colleague Jeremy Manson wrote about this in his blog). The (hope) is that by explicitly clearing the reference, the GC does a little less work. We want to be GC hygienic, e.g. j.u.LinkedList#clear() nulls out the elements rather than merely setting the head/tail to null in order to assist if the nodes inhabit multiple generations (thereby requiring a full GC to collect).

Since GC is magical and we'd need an expert like Jeremy to tell us the optimal incarnation, the clear() is merely a best practice in hopes of helping. It might not be doing anything worthwhile. Your core java team might have some insights here.

The other option is to check on read if a stale reference was observed. If null, then a new read barrier ensures that the thread sees the prior write (and the next iteration's plain read cannot go back in time). That would look something like,

public final V getValue() {
  for (;;) {
    Reference<V> ref = ((Reference<V>) VALUE.get(this));
    V value = ref.get();
    if ((value != null) || (ref == VALUE.getVolatile(this))) {
      return value;
    }
  }
}

public final void setValue(V value, ReferenceQueue<V> referenceQueue) {
  Reference<V> ref = ((Reference<V>) getValueReference());
  VALUE.set(this, new WeakValueReference<V>(getKeyReference(), value, referenceQueue));
  ref.clear();  
}

The slightly higher cost in user code might be preferable to adding GC work. This is very much a micro-optimization, but generally worth considering in data structure classes.

@justinhorvitz
Copy link
Contributor Author

Unfortunately I'm finding that while the proposed solutions fix my provided repro, they don't actually fix the issue in our tests.

I don't have a full explanation for what's going on yet. Upon adding more debugging info, I am seeing removal notifications of type COLLECTED but with a non-null value - is that ever supposed to happen? I'm seeing this both with and without the proposed changes discussed above. It seems to happen right after the call to put. When I get such a removal notification, I verified that I still have a strong reference to the object, so it should not have been collected.

@ben-manes
Copy link
Owner

Can you narrow it down to a unit test? Let's debug only with the removal of Reference.clear() as the safest option.

A COLLECTED may occur if either the key or value is null. It should not occur if both are non-null.

@justinhorvitz
Copy link
Contributor Author

I'll try to get a better explanation tomorrow, but I'm definitely seeing a COLLECTED notification with a non-null key and value just prior to the test failing with the missing cache entry.

I grabbed a stack trace of what's submitting the weird removal notification:

com.github.benmanes.caffeine.cache.BoundedLocalCache.notifyRemoval(BoundedLocalCache.java:331)
com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntry(BoundedLocalCache.java:979)
com.github.benmanes.caffeine.cache.BoundedLocalCache.drainValueReferences(BoundedLocalCache.java:1523)
com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1483)
com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1454)
com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3299)
com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.exec(BoundedLocalCache.java:3286)
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

We're using strong keys & weak values, version 2.8.5.

@justinhorvitz
Copy link
Contributor Author

Line numbers are probably off because I added some debugging code.

@ben-manes
Copy link
Owner

hmm.. I do know that there is a bug in Guava where that can occur. Nothing jumps out when scanning over the code and it seems correct by checking for null first before supplying that cause. Thanks for digging into this.

@ben-manes
Copy link
Owner

ben-manes commented Jun 29, 2021

Thinking about solution 1 of not calling Reference#clear() and I don't think that will actually work. Since the value is eligible for collection at any time, the GC might clear it by the time we obtain the null referent. The retry loop to validate that we correctly observe null as the live mapping is therefore required.

@ben-manes
Copy link
Owner

Please try this patch in your tests and see if it resolves the original problem. This implements the retry loop to resolve the race.

@justinhorvitz
Copy link
Contributor Author

I think that the issue as originally described is not actually affecting our tests, I just discovered it by accident. Instead, my understanding is that we're encountering a race between cleanup of an evicted value and a call to put to update that value.

  1. During cleanup, we check that the node's value reference is still the one that was enqueued.
  2. A concurrent call to put updates the node's value reference to a live value.
  3. The call to evictEntry proceeds to obtain a lock on the node, but does not check whether the node was updated with a new value. Thus we proceed to clean up a live entry.

With that understanding, I modified evictEntry with the following check and the issue disappeared, even without any of the other changes we discussed:

synchronized (n) {
  value[0] = n.getValue();
  if (cause == RemovalCause.COLLECTED && value[0] != null) {
    resurrect[0] = true;
    return n;
  }
}

Does this explanation and solution make sense to you?

I still think we should fix the originally reported issue as well.

@ben-manes
Copy link
Owner

That's another excellent find! Yes, lets fix both.

I think that we can write a unit test by downcasting to the implementation (BoundedLocalCacheTest), clearing the reference to simulate a garbage collection, and throwing threads to catch the race.

@justinhorvitz
Copy link
Contributor Author

Here's a stress test that seems to reproduce the 2nd issue:

AtomicReference<Exception> e = new AtomicReference<>();
Cache<String, Object> cache =
        Caffeine.newBuilder()
            .weakValues()
            .removalListener(
                (k, v, cause) -> {
                  if (cause == RemovalCause.COLLECTED && v != null) {
                    e.compareAndSet(null, new IllegalStateException("Evicted a live value: " + v));
                  }
                })
            .build();

String key = "key";
cache.put(key, new Object());

List<Thread> threads = new ArrayList<>();
AtomicBoolean missing = new AtomicBoolean();
for (int i = 0; i < 100; i++) {
  Thread t =
          new Thread(
              () -> {
                for (int j = 0; j < 1000; j++) {
                  if (e.get() != null) {
                    break;
                  }
                  if (Math.random() < .01) {
                    System.gc();
                    cache.cleanUp();
                  } else if (cache.getIfPresent(key) == null && !missing.getAndSet(true)) {
                    cache.put(key, new Object());
                    missing.set(false);
                  }
                }
              });
    threads.add(t);
    t.start();
}

for (Thread t : threads) {
  t.join();
}

if (e.get() != null) {
  throw e.get();
}

As for the fix for the first issue - I don't understand why a loop is necessary. Can we just follow up the non-volatile read with a volatile read?

I'm not sure of the best way to write the tests with the existing framework, so I'm hoping you can help me get these ideas merged.

@ben-manes
Copy link
Owner

Thank you @justinhorvitz. Sorry that I haven't been able to focus on this much (its a meeting day for me). My plan is to translate your work into our test framework and a commit, and add you to co-authored-by for attribution. Of course you are welcome to send a PR but that is not necessary.

As for the fix for the first issue - I don't understand why a loop is necessary. Can we just follow up the non-volatile read with a volatile read?

The problem is that we have an indirection, value -> Reference -> referent. At any point the thread may be context switched out and the environment changed (referent was collected, value was replaced). This indirection means that even if we only used a volatile read then a race can occur, as our decision depends on dereferencing the state and assuming nothing changed. We can only get away with not validating on a non-null read because non-determinism is expected, but as you observed for a null that may never be an accurate visible result. The validation loop allows us to detect that. If we did not loop, then the if the volatile read differed we still couldn't trust the new referent as it too might have been collected. The context switch duration might be in the distant future.

The non-volatile read is there because we can piggyback on the Map.get's volatile read. This is a memory read barrier, which enforces a point-in-time visibility within the MESI cache coherence protocol. This plain read means that we know the value up until the time of the Map.get, so as a micro-optimization we don't need stronger consistency of a few instructions later. Yet that can be inconsistent, so we need a volatile read to set a new point-in-time barrier for validation check.

@justinhorvitz
Copy link
Contributor Author

Makes sense. We're trying to distinguish between a GC clear (returning null is valid) and a stale read (need to read the updated reference). I guess I would write it as such, to avoid the double read when it loops:

public final V getValue() {
  Reference<V> ref = (Reference<V>) VALUE.get(this);
  for (;;) {
    V value = ref.get();
    if (value != null) {
      return value;
    }
    Reference<V> newRef = (Reference<V>) VALUE.getVolatile(this);
    if (newRef == ref) {
      return null; // Reference is up to date - it must have been cleared by GC.
    }
    ref = newRef;
  }
}

Also wonder if there are any cases where getValueReference would need to be implemented accordingly.

@ben-manes
Copy link
Owner

Yeah, I suppose it's a preference of which is more readable? The shorter code is concise and I'd assume that a plain read would be free, as a VarHandle is an intrinsic. But perhaps it is less obvious compared to your longer form. I'll review for style before merging the fixes, as I'm not sure which I prefer.

It looks like getValueReference is only called by drainValueReferences(). As ReferenceQueue#poll() is synchronized, each iteration forces a new memory barrier we can piggyback on. Thanks to your resurrection patch the interactions look safe.

ben-manes added a commit that referenced this issue Jul 1, 2021
1. When an entry is updated then a concurrent reader should either
observe the old or new value. This operation replaces the j.l.Reference
instance stored on the entry and the old referent becomes eligible for
garbage collection. A reader holding the stale Reference may therefore
return a null value, which is more likely due to the cache proactively
clearing the referent to assist the garbage collector.

When a null value is read then the an extra volatile read is used to
validate that the Reference instance is still held by the entry. This
retry loop has negligible cost.

2. When an entry is eligible for removal due to its value being garbage
collected, during the eviction's atomic map operation this eligibility
must be verified. If concurrently the entry was resurrected and a new
value set, then the cache writer has already dispatched the removal
notification and established a live mapping. If the evictor does not
detect that the cause is no longer valid, then it would incorrectly
discard the mapping with a removal notification containing a non-null
key, non-null value, and collected removal cause.

Like expiration and size policies, the reference eviction policy will
now validate and no-op if the entry is no longer eligible.

3. When the fixed expiration setting is dynamically adjusted, an
expired entry may be resurrected as no longer eligible for removoal.
While the map operation detected this case, stemming from the entry
itself being updated and its lifetime reset, the outer eviction loop
could retry indefinitely. A stale read of the fixed duration caused
the loop to retry the ineligible entry, but instead it can terminatee
as it scans a queue ordered by the expiration timestamp.

Co-authored-by: Justin Horvitz <jhorvitz@google.com>
ben-manes added a commit that referenced this issue Jul 1, 2021
1. When an entry is updated then a concurrent reader should observe
either the old or new value. This operation replaces the j.l.Reference
instance stored on the entry and the old referent becomes eligible for
garbage collection. A reader holding the stale Reference may therefore
return a null value, which is more likely due to the cache proactively
clearing the referent to assist the garbage collector.

When a null value is read then an extra volatile read is used to
validate that the Reference instance is still held by the entry. This
retry loop has negligible cost.

2. When an entry is eligible for removal due to its value being garbage
collected, then during the eviction's atomic map operation this
eligibility must be verified. If concurrently the entry was resurrected
and a new value set, then the cache writer has already dispatched the
removal notification and established a live mapping. If the evictor does
not detect that the cause is no longer valid, then it would incorrectly
discard the mapping with a removal notification containing a non-null
key, non-null value, and collected removal cause.

Like expiration and size policies, the reference eviction policy will
now validate and no-op if the entry is no longer eligible.

3. When the fixed expiration setting is dynamically adjusted, an
expired entry may be resurrected as no longer eligible for removal.
While the map operation detected this case, stemming from the entry
itself being updated and its lifetime reset, the outer eviction loop
could retry indefinitely due to a stale read of the fixed duration.
This caused the loop to retry the ineligible entry, but instead it
can terminate when eviction fails because it scans a queue ordered
by the expiration timestamp.

Co-authored-by: Justin Horvitz <jhorvitz@google.com>
@ben-manes
Copy link
Owner

When adding more coverage on resurrected entries, I found a bug if the fixed expiration time is updated. This was stuck in a retry loop since it did not re-read the duration but rather kept it in a stale variable outside of the loop. Thus it kept evaluating the next entry as expired. Since this queue is ordered the loop can terminate when evictEntry fails.

I'll backport these changes to 2.x tomorrow and cut a release. You should have a new version (2.x, 3.x preferred) by Friday morning.

ben-manes added a commit that referenced this issue Jul 1, 2021
1. When an entry is updated then a concurrent reader should observe
either the old or new value. This operation replaces the j.l.Reference
instance stored on the entry and the old referent becomes eligible for
garbage collection. A reader holding the stale Reference may therefore
return a null value, which is more likely due to the cache proactively
clearing the referent to assist the garbage collector.

When a null value is read then an extra volatile read is used to
validate that the Reference instance is still held by the entry. This
retry loop has negligible cost.

2. When an entry is eligible for removal due to its value being garbage
collected, then during the eviction's atomic map operation this
eligibility must be verified. If concurrently the entry was resurrected
and a new value set, then the cache writer has already dispatched the
removal notification and established a live mapping. If the evictor does
not detect that the cause is no longer valid, then it would incorrectly
discard the mapping with a removal notification containing a non-null
key, non-null value, and collected removal cause.

Like expiration and size policies, the reference eviction policy will
now validate and no-op if the entry is no longer eligible.

3. When the fixed expiration setting is dynamically adjusted, an
expired entry may be resurrected as no longer eligible for removal.
While the map operation detected this case, stemming from the entry
itself being updated and its lifetime reset, the outer eviction loop
could retry indefinitely due to a stale read of the fixed duration.
This caused the loop to retry the ineligible entry, but instead it
can terminate when eviction fails because it scans a queue ordered
by the expiration timestamp.

Co-authored-by: Justin Horvitz <jhorvitz@google.com>
ben-manes added a commit that referenced this issue Jul 1, 2021
1. When an entry is updated then a concurrent reader should observe
either the old or new value. This operation replaces the j.l.Reference
instance stored on the entry and the old referent becomes eligible for
garbage collection. A reader holding the stale Reference may therefore
return a null value, which is more likely due to the cache proactively
clearing the referent to assist the garbage collector.

When a null value is read then an extra volatile read is used to
validate that the Reference instance is still held by the entry. This
retry loop has negligible cost.

2. When an entry is eligible for removal due to its value being garbage
collected, then during the eviction's atomic map operation this
eligibility must be verified. If concurrently the entry was resurrected
and a new value set, then the cache writer has already dispatched the
removal notification and established a live mapping. If the evictor does
not detect that the cause is no longer valid, then it would incorrectly
discard the mapping with a removal notification containing a non-null
key, non-null value, and collected removal cause.

Like expiration and size policies, the reference eviction policy will
now validate and no-op if the entry is no longer eligible.

3. When the fixed expiration setting is dynamically adjusted, an
expired entry may be resurrected as no longer eligible for removal.
While the map operation detected this case, stemming from the entry
itself being updated and its lifetime reset, the outer eviction loop
could retry indefinitely due to a stale read of the fixed duration.
This caused the loop to retry the ineligible entry, but instead it
can terminate when eviction fails because it scans a queue ordered
by the expiration timestamp.

Co-authored-by: Justin Horvitz <jhorvitz@google.com>
ben-manes added a commit that referenced this issue Jul 1, 2021
1. When an entry is updated then a concurrent reader should observe
either the old or new value. This operation replaces the j.l.Reference
instance stored on the entry and the old referent becomes eligible for
garbage collection. A reader holding the stale Reference may therefore
return a null value, which is more likely due to the cache proactively
clearing the referent to assist the garbage collector.

When a null value is read then an extra volatile read is used to
validate that the Reference instance is still held by the entry. This
retry loop has negligible cost.

2. When an entry is eligible for removal due to its value being garbage
collected, then during the eviction's atomic map operation this
eligibility must be verified. If concurrently the entry was resurrected
and a new value set, then the cache writer has already dispatched the
removal notification and established a live mapping. If the evictor does
not detect that the cause is no longer valid, then it would incorrectly
discard the mapping with a removal notification containing a non-null
key, non-null value, and collected removal cause.

Like expiration and size policies, the reference eviction policy will
now validate and no-op if the entry is no longer eligible.

3. When the fixed expiration setting is dynamically adjusted, an
expired entry may be resurrected as no longer eligible for removal.
While the map operation detected this case, stemming from the entry
itself being updated and its lifetime reset, the outer eviction loop
could retry indefinitely due to a stale read of the fixed duration.
This caused the loop to retry the ineligible entry, but instead it
can terminate when eviction fails because it scans a queue ordered
by the expiration timestamp.

Co-authored-by: Justin Horvitz <jhorvitz@google.com>
ben-manes added a commit that referenced this issue Jul 2, 2021
1. When an entry is updated then a concurrent reader should observe
either the old or new value. This operation replaces the j.l.Reference
instance stored on the entry and the old referent becomes eligible for
garbage collection. A reader holding the stale Reference may therefore
return a null value, which is more likely due to the cache proactively
clearing the referent to assist the garbage collector.

When a null value is read then an extra volatile read is used to
validate that the Reference instance is still held by the entry. This
retry loop has negligible cost.

2. When an entry is eligible for removal due to its value being garbage
collected, then during the eviction's atomic map operation this
eligibility must be verified. If concurrently the entry was resurrected
and a new value set, then the cache writer has already dispatched the
removal notification and established a live mapping. If the evictor does
not detect that the cause is no longer valid, then it would incorrectly
discard the mapping with a removal notification containing a non-null
key, non-null value, and collected removal cause.

Like expiration and size policies, the reference eviction policy will
now validate and no-op if the entry is no longer eligible.

3. When the fixed expiration setting is dynamically adjusted, an
expired entry may be resurrected as no longer eligible for removal.
While the map operation detected this case, stemming from the entry
itself being updated and its lifetime reset, the outer eviction loop
could retry indefinitely due to a stale read of the fixed duration.
This caused the loop to retry the ineligible entry, but instead it
can terminate when eviction fails because it scans a queue ordered
by the expiration timestamp.

Co-authored-by: Justin Horvitz <jhorvitz@google.com>
ben-manes added a commit that referenced this issue Jul 2, 2021
1. When an entry is updated then a concurrent reader should observe
either the old or new value. This operation replaces the j.l.Reference
instance stored on the entry and the old referent becomes eligible for
garbage collection. A reader holding the stale Reference may therefore
return a null value, which is more likely due to the cache proactively
clearing the referent to assist the garbage collector.

When a null value is read then an extra volatile read is used to
validate that the Reference instance is still held by the entry. This
retry loop has negligible cost.

2. When an entry is eligible for removal due to its value being garbage
collected, then during the eviction's atomic map operation this
eligibility must be verified. If concurrently the entry was resurrected
and a new value set, then the cache writer has already dispatched the
removal notification and established a live mapping. If the evictor does
not detect that the cause is no longer valid, then it would incorrectly
discard the mapping with a removal notification containing a non-null
key, non-null value, and collected removal cause.

Like expiration and size policies, the reference eviction policy will
now validate and no-op if the entry is no longer eligible.

3. When the fixed expiration setting is dynamically adjusted, an
expired entry may be resurrected as no longer eligible for removal.
While the map operation detected this case, stemming from the entry
itself being updated and its lifetime reset, the outer eviction loop
could retry indefinitely due to a stale read of the fixed duration.
This caused the loop to retry the ineligible entry, but instead it
can terminate when eviction fails because it scans a queue ordered
by the expiration timestamp.

Co-authored-by: Justin Horvitz <jhorvitz@google.com>
@ben-manes
Copy link
Owner

Released in 2.9.2 and 3.0.3. Thanks again for all your help on this.

@justinhorvitz
Copy link
Contributor Author

Thank you!

bazel-io pushed a commit to bazelbuild/bazel that referenced this issue Jul 7, 2021
Contains critical fixes for ben-manes/caffeine#568.

Closes #13651.

Signed-off-by: Philipp Wollermann <philwo@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants