Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected removalListener call after upgrade from 2.7.0 to 3.0.3 #593

Closed
hejcz opened this issue Sep 10, 2021 · 8 comments
Closed

Unexpected removalListener call after upgrade from 2.7.0 to 3.0.3 #593

hejcz opened this issue Sep 10, 2021 · 8 comments

Comments

@hejcz
Copy link

hejcz commented Sep 10, 2021

In part of our codebase we assume that removal listener is called only when value no longer sits in cache. Here is how we configure the cache:

ListeningExecutorService cacheExecutorService = MoreExecutors.listeningDecorator(Executors.newFixedThreadPool(2,
            new ThreadFactoryBuilder()
                    .setNameFormat("resource-manager" + "-thread-%d")
                    .setDaemon(true)
                    .build()));

LoadingCache<String, SomeResource> cache = Caffeine.newBuilder()
            .<String, SomeResource>removalListener((key, value, cause) -> {
                // invalidate an object - it will be no longer usable
            })
            .refreshAfterWrite(Duration.ofSeconds(1))
            .executor(cacheExecutorService)
            .build(new CacheLoader<>() {
                @Override
                public @Nullable CaffeineTest.SomeResource load(String key) throws Exception {
                    // sometimes load a resource from external storage and sometimes return an already loaded object.
                }
            });

In 2.7.0 it's worked for us for a long time. Once we updated to 3.0.3 we observed that after some time some machines end up with invalidated resource - loader returns a reference to the same object so value does not change but removalListener is called anyway resulting in closing a resource. I've tried to debug caffeine a bit and found out that we can observe such behavior if in BoundedLocalCache#refreshIfNeeded

currentValue != oldValue || node.getWriteTime() != writeTime

I believe we might observe a problem because of node.getWriteTime() != writeTime part. I've wrote a test:
https://github.com/hejcz/caffeine-removal-listener-inspect/blob/main/src/test/java/io/github/hejcz/CaffeineTest.java
illustrating the problem. It doesn't fail every time but usually after 3-5 reloads it prints that a resource received from cache is invalidated.

Could you confirm that what we see is not a bug and this behavior changed between 2.7.0 and 3.0.3? Is there some alternative mechanism we can use? What we want is to clean up a value once it doesn't sits in the cache anymore. Values are unique in cache (in test they're not but in test we always load the same reference).

@ben-manes
Copy link
Owner

I will investigate over the weekend (a bit late my time). The refresh implementation was rewritten in 3.0 and it has always been very complex to support. It may be a bug as I agree with your expectations. Thanks for he test and detailed report.

@ben-manes
Copy link
Owner

An offhand glance and I think it’s because we now use the LSB of the timestamp as a refresh lock flag. That condition did not apply a mask to ignore it, so it’s incorrectly assumed the difference to be invalidation as if a manual write was performed. Since we suppress notifications for replacements with the same reference in all other cases, I think that timestamp check is wrong anyway and we can rely only on the reference check. wdyt?

@hejcz
Copy link
Author

hejcz commented Sep 10, 2021

That makes sense! I don't know Caffeine that well but if we suppress analogous replacements in all other places it is worth it to be consistent here. Can't find the place where this mask ignoring LSB is applied though. Could you link some place where it's applied?

@ben-manes
Copy link
Owner

The suppression is one minor difference from Guava and exactly for your case where the resource would be closed. An idempotent write is shouldn’t trigger a removal notification, but Guava overlooked that case.

When setting the write time it’s zero’d

void setWriteTime(Node<K, V> node, long now) {
if (expiresAfterWrite() || refreshAfterWrite()) {
node.setWriteTime(now & ~1L);
}
}

Before refreshing we prepare the lock with

Then try to acquire it

&& ((writeTime & 1L) == 0L) && node.casWriteTime(writeTime, refreshWriteTime)) {

And release it

node.casWriteTime(refreshWriteTime, writeTime);

This way multiple readers do not block on the computeIfAbsent, but skip thanks to this soft lock. It seems that optimization has not been released yet, though.

@ben-manes
Copy link
Owner

I guess it’s fine as it’s unlocked before the callback and unreleased code. I’ll investigate your test later and try to release a fix for you by Monday.

@ben-manes
Copy link
Owner

It seems this condition exists in v2.x and was introduced in 2016. The test passes in 2.x and 3.0.4-SNAPSHOT because both use the writeTime as a soft lock to avoid redundant refresh attempts. That seems to be enough to mask the problem because it avoids a redundant reload attempt from being scheduled with a stale timestamp read.

The condition as written tries to check that the cache still holds the same reference as when the reload started and that no other writes to that entry occurred (see ABA problem). This would match Guava which fails your test, as it discards the refresh if any entry modification occurs. We would also want to discard the refresh if an explicit cache.put(k, v) occurred which this detects condition discovers, so as is it is correct. However, Caffeine suppresses the listener notification if the replaced reference is not a change and that should was not accounted for.

I believe then the fix will look something like,

compute(key, (k, currentValue) -> {
  if (currentValue == null) {
    if (value == null) {
      return null;
    } else if (refreshes().get(key) == refreshFuture[0]) {
      return value;
    }
  } else if (currentValue == value) {
    return value;
  } else if ((currentValue == oldValue) && (node.getWriteTime() == writeTime)) {
    return value;
  }
  discard[0] = true;
  return currentValue;
}

I'll take a fresh look at this tomorrow, add tests, and some helpful comments to explain the code rational.

@ben-manes
Copy link
Owner

Released in 3.0.4

@hejcz
Copy link
Author

hejcz commented Sep 13, 2021

Thank you for identifying and fixing this so fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants