Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single slow load deadlocks entire cache #672

Closed
jan-berge-ommedal opened this issue Feb 23, 2022 · 12 comments
Closed

single slow load deadlocks entire cache #672

jan-berge-ommedal opened this issue Feb 23, 2022 · 12 comments

Comments

@jan-berge-ommedal
Copy link

jan-berge-ommedal commented Feb 23, 2022

Hi there!

We had an interesting case with caffeine 3.0.5 in production today where a cache suddenly went from 100 rps / 3 evicts/s to 0 on some of our instances. Looking at the thread dumps, these instances were completely deadlocked with 1400 threads waiting for a single slow load.

This might be something similar to what was reported in #506.

Although there may be nothing to do about the entry experiencing an extremely slow load, it came as a surprise that other unrelated loads became blocked by it. Shouldn't they remain unaffected?

I put together a repro, run it for half a minute or so:

import com.github.benmanes.caffeine.cache.Caffeine;
import com.github.benmanes.caffeine.cache.LoadingCache;

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.stream.IntStream;

public class CaffeineDeadlock {

    public static void main(String[] args) {
        LoadingCache<Integer, String> cache = Caffeine.newBuilder()
                .expireAfterWrite(1, TimeUnit.SECONDS)
                .build(CaffeineDeadlock::load);

        /*   cleanup task and loads will try to evict this element but must wait for a single slow load **while holding the evictionLock**
         *   and completely deadlock the cache.
         *
         *   Please see BoundedLocalCache.performCleanUp():
         *
         *   void performCleanUp(@Nullable Runnable task) {
         *     evictionLock.lock();
         *     try {
         *       maintenance(task);
         *     } finally {
         *       evictionLock.unlock();
         *     }
         *     if ((drainStatus() == REQUIRED) && (executor == ForkJoinPool.commonPool())) {
         *       scheduleDrainBuffers();
         *     }
         *   }
         */
        cache.put(0, "0");

        ExecutorService executorService = Executors.newCachedThreadPool();
        IntStream.range(0, 100).forEach(value -> executorService.submit(() -> {
            while (true) {
                System.out.println("get: " + cache.get(value));
                Thread.sleep(100);
            }
        }));
    }

    private static String load(Integer integer) {
        System.out.println("load: " + integer);
        if (integer == 0) {
            try {
                Thread.sleep(Long.MAX_VALUE);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
        return integer.toString();
    }
}


@ben-manes
Copy link
Owner

I don't think this will deadlock the cache but it would block evictions. This is how ConcurrentHashMap operates where entries share the hashbin lock. That can be reduced by increasing the initialCapacity to increase the number of lock stripes, thereby reducing the chances, or by using an AsyncCache to decouple the computation from the map operation. That can be done without needing additional threads, e.g. Solr's usage, but does remove some linearization properties. I agree this is an unfortunate gotcha but it is something we provide the tools to resolve for those who have that type of workload.

@ben-manes
Copy link
Owner

In the public JavaDoc,

The entire method invocation is performed atomically, so the function is applied at most once per key. Some attempted update operations on this cache by other threads may be blocked while the computation is in progress, so the computation should be short and simple, and must not attempt to update any other mappings of this cache.

The internal JavaDoc of CHM,

Insertion (via put or its variants) of the first node in an
empty bin is performed by just CASing it to the bin. This is
by far the most common case for put operations under most
key/hash distributions. Other update operations (insert,
delete, and replace) require locks. We do not want to waste
the space required to associate a distinct lock object with
each bin, so instead use the first node of a bin list itself as
a lock. Locking support for these locks relies on builtin
"synchronized" monitors.

Using the first node of a list as a lock does not by itself
suffice though: When a node is locked, any update must first
validate that it is still the first node after locking it, and
retry if not. Because new nodes are always appended to lists,
once a node is first in a bin, it remains first until deleted
or the bin becomes invalidated (upon resizing).

The main disadvantage of per-bin locks is that other update
operations on other nodes in a bin list protected by the same
lock can stall, for example when user equals() or mapping
functions take a long time. However, statistically, under
random hash codes, this is not a common problem. Ideally, the
frequency of nodes in bins follows a Poisson distribution
(http://en.wikipedia.org/wiki/Poisson_distribution) with a
parameter of about 0.5 on average, given the resizing threshold
of 0.75, although with a large variance because of resizing
granularity. Ignoring variance, the expected occurrences of
list size k are (exp(-0.5) * pow(0.5, k) / factorial(k)). The
first values are:

0: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006
more: less than 1 in ten million

Lock contention probability for two threads accessing distinct
elements is roughly 1 / (8 * #elements) under random hashes.

If I recall correctly, the issue in #506 was due to a stackoverflow and out-of-memory error causing the finally block to not unwind the lock. That was an application / jvm issue.

@ben-manes
Copy link
Owner

fwiw, the issue and mitigations are discussed in our FAQ.

@ben-manes
Copy link
Owner

In your test case the simplest solution is to change how the cache is configured. The following is drop-in and does not block evictions.

LoadingCache<Integer, String> cache = Caffeine.newBuilder()
    .expireAfterWrite(1, TimeUnit.SECONDS)
    .buildAsync(CaffeineDeadlock::load)
    .synchronous();

Let me know if you have any further questions.

@jan-berge-ommedal
Copy link
Author

Let me first back up and say that we are extremely happy with caffeine. We use it extensively and it works really well!

And thank you for your prompt and detailed reply, we will look into reconfiguring the cache but also the underlying edge case in our application that caused this single slow load as it should never be that slow.

Still, i have a couple of further thoughts/notes/ramblings on this issue. Please bear with me:

I don't think this will deadlock the cache but it would block evictions.

Right, deadlock is not precise terminology. Nevertheless, one single read can block all other reads even with distinct keys due to global synchronization during evictions.

In the public JavaDoc: "computation should be short and simple"

The public javadoc also states:

Implementations of this interface are expected to be thread-safe, and can be safely accessed by multiple concurrent threads. and Note that multiple threads can concurrently load values for distinct keys.

It is reasonable that concurrent operations on the same distinct key leads to contention, but it is not clear from the documentation that a slow load for a single distinct key can cause contention across the entire cache. Since we agree this is an unfortunate gotcha, is there perhaps something that could be done with this? I don't have in-depth insight here, but is it an alternative to skip maintenance() if the eviction lock cannot be acquired within a certain timeout?

If I recall correctly, the issue in #506 was due to a stackoverflow and out-of-memory error causing the finally block to not unwind the lock. That was an application / jvm issue.

To me, it seems that #506 was inconclusive and difficult to reproduce. It also refers #203 which is the same story and mentions deadlocks during cleanup. I might be totally off here, but i thought it was worth mentioning since i could easily reproduce a similar situation.

fwiw, the issue and mitigations are discussed in our FAQ.

I would argue it is not the same issue as there is just one single entry being computed here. One thread was executing application code, the remaining 1.4k threads were blocked inside caffeine code.

The following is drop-in and does not block evictions.

I confirmed this to be working fine in the test case. In our application however, we heavily rely on thread local state so it is not a drop-in solution there unfortunately. As long as we are able to preserve the state onto the loading thread it should work but this i depending on if libraries we use play nicely. So we will have to look into that. Injecting a custom executor seems to be the way to go about this. Or caching a lazy value that evaluates on the calling thread. Hm. Just works would be better!

@ben-manes
Copy link
Owner

Since we agree this is an unfortunate gotcha, is there perhaps something that could be done with this? I don't have in-depth insight here, but is it an alternative to skip maintenance() if the eviction lock cannot be acquired within a certain timeout?

Unfortunately I think only the documentation could be clarified. At some point the cache needs to update the eviction policy, in this case to add the items onto its O(1) priority queues. Skipping the policy update to add a new entry would lead to inconsistency where the policy is unaware of hash table entries, which is a memory leak. The cache has to be strict for these types of events, whereas it is only best-effort for events that are purely for making more optimal choices, like LRU reordering if bounded by size. These events are buffered to avoid blocking on the eviction lock, with optimistic events in a lossy buffer and strict ones in a bounded one. When the bounded write buffer is full then writers block on the lock to assist. That usually happens when using a broken Executor (e.g. JDK-8274349) or if the write rate exceeds the eviction rate (e.g. priority inversion) so the cache adds some back pressure to keep up.

It is reasonable that concurrent operations on the same distinct key leads to contention, but it is not clear from the documentation that a slow load for a single distinct key can cause contention across the entire cache.

I agree that this can be improved on. We include the snippet from ConcurrentHashMap which does hint to this:

Some attempted update operations on this cache by other threads may be blocked while the
computation is in progress, so the computation should be short and simple

Their locking mechanism means that operations like clearing or resizing the hash table are blocked while the computation is in progress. For your case the application's cache load defers to ConcurrentHashMap.computeIfAbsent which locks inside of the map and blocks the eviction's usage of ConcurrentHashMap.compute to discard a sibling entry. As there is no try-lock mechanism for a map computation, and we can't know if two entries share the same hash table lock, there is no way for the cache to detect and avoid this scenario. You are welcome to raise the issue to Doug Lea, as I think us writing a custom hash table is undesirable and out of scope.

In our application however, we heavily rely on thread local state so it is not a drop-in solution there unfortunately.

In that case you might favor using Guava's cache, which uses a custom hash table to support computations and other features. That uses a custom hash table entry type to insert the reservation, but perform the computation outside of the hash table lock and under the entry's. That is a lower-level equivalent to the future-based approach or a lazy value (e.g. Suppliers.memoize(s)). You can coerce Caffeine to behave the same and avoid extra threads, similar to the FAQ's recursion example, but it is more noise. The Guava APIs are identical and seems simpler to use that if all else is equal.

@jan-berge-ommedal
Copy link
Author

At some point the cache needs to update the eviction policy, in this case to add the items onto its O(1) priority queues. Skipping the policy update to add a new entry would lead to inconsistency where the policy is unaware of hash table entries, which is a memory leak.

Correct me if i am wrong (i probably am), but the policy updates seems to operate fairly independently from the hash table. Could they be executed under a different lock than the eviction lock, allowing them to complete even if evictions are blocked?

A memory leak is obviously bad, but might it actually be the preferred failure mode for some workloads? Especially when the cache size is unbounded like in the repro.

When the bounded write buffer is full then writers block on the lock to assist. That usually happens when using a broken Executor (e.g. JDK-8274349) or if the write rate exceeds the eviction rate (e.g. priority inversion) so the cache adds some back pressure to keep up.

Isn't this an interesting event that could be worth logging or to include in the cache stats? It might have saved us some time when we scrambled to figure out why some of our instances suddenly stalled...

As there is no try-lock mechanism for a map computation, and we can't know if two entries share the same hash table lock, there is no way for the cache to detect and avoid this scenario. You are welcome to raise the issue to Doug Lea, as I think us writing a custom hash table is undesirable and out of scope.

We ended here as well and fully agree. Using the mostly awesome CHM seems like the right approach. So considerations should instead go into if caffeine could act differently if/when this scenario occurs.

In that case you might favor using Guava's cache

Thanks for the tip but let's not go there just yet. We still love caffeine!

@ben-manes
Copy link
Owner

Correct me if i am wrong (i probably am), but the policy updates seems to operate fairly independently from the hash table.

This is how it works, unlike most other caches which integrate the two. The map has its own locking structure and the metadata is decoupled where those data structures are protected by the eviction lock. In a cache an entry read or write requires updating the policy metadata, such as adding or reordering on an LRU doubly-linked list. To avoid every thread needing to acquire the lock to mutate this (ala LinkedHashMap), we use ring buffers to communicate so that work can be staged for one thread to drain. The eviction lock is typically acquired using a try-lock so that threads can submit their policy update and only one is penalized to drain and catch up the policy. Since the ring buffers are bounded when they fill up then a strategy is needed, which for write events is to assist and for read events is to discard.

The problem here is that eviction needs to remove an entry from the hash table, e.g. the expired entry. That removal occurs while under the eviction lock and, if successful, the entry is removed from the policy's metadata. This isn't always successful, e.g. a race where an explicit removal won or other update caused it to be ineligible ("resurrected", see evictEntry). In your case the hash table's coarse locking caused this to block forever, which caused the the write buffer to fill up, and further writes to block trying to assist.

The cache itself uses per-entry locks and maintains a state machine (alive, retired, dead) which assumes entries are always removed from the hash table prior to the policy. It may be possible to vary the "retired" state into two modes, retire the eviction candidate prior to the map operation, and perform that outside of the eviction lock. That incurs many more race conditions to consider, with the minimal benefit that the policy metadata is updated. The negative impact is it still leaves the eviction thread blocked indefinitely, that others might barge in to evict, etc. The root problem being that the computation is coupled to the map operation, which is outside of the cache's scope as a design decision of the underlying hash table. Thus we recommend these cases use a computing value like a future or memoizing supplier.

A memory leak is obviously bad, but might it actually be the preferred failure mode for some workloads? Especially when the cache size is unbounded like in the repro.

The counter is that failing fast can help identify a problem rather than masking it, e.g. until servers are up longer during holiday break when no code is being deployed. In this case there would still be stalled threads, but at best we might be able to slow down how quickly until it locks up.

Isn't this an interesting event that could be worth logging or to include in the cache stats?

What would that metric be? This can occur during cache warming, e.g. putAll(...).

Thanks for the tip but let's not go there just yet. We still love caffeine!

The approach to do this per-entry computation in Caffeine while still using the same thread is a little more verbose. Instead you use the Java Concurrency in Practice example of a future task for the lazy value. Your test case below passes if rewritten as,

public class CaffeineNoDeadlock {

  public static void main(String[] args) {
    AsyncCache<Integer, String> cache = Caffeine.newBuilder()
        .expireAfterWrite(1, TimeUnit.SECONDS)
        .buildAsync();
    cache.synchronous().put(0, "0");

    ExecutorService executorService = Executors.newCachedThreadPool();
    IntStream.range(0, 100).forEach(value -> executorService.submit(() -> {
      while (true) {
        var future = new CompletableFuture<String>();
        var prior = cache.asMap().putIfAbsent(value, future);
        if (prior == null) {
          try {
            future.complete(load(value));
          } catch (Exception e) {
            future.completeExceptionally(e);
            throw e;
          }
        } else {
          future = prior;
        }
        System.out.println("get: " + future.join());
        Thread.sleep(100);
      }
    }));
  }

  private static String load(Integer integer) {
    System.out.println("load: " + integer);
    if (integer == 0) {
      try {
        Thread.sleep(Long.MAX_VALUE);
      } catch (InterruptedException e) {
        e.printStackTrace();
      }
    }
    return integer.toString();
  }
}

@jan-berge-ommedal
Copy link
Author

That incurs many more race conditions to consider, with the minimal benefit that the policy metadata is updated. The negative impact is it still leaves the eviction thread blocked indefinitely, that others might barge in to evict, etc. The root problem being that the computation is coupled to the map operation, which is outside of the cache's scope as a design decision of the underlying hash table.

Understood and appreciated.

What would that metric be? This can occur during cache warming, e.g. putAll(...).

I was thinking a "writeBufferFull" counter could be incremented if writers fail their 100 attempts at offering the task to the write buffer, but if this also occurs during normal operation i guess its value is limited(?)

Since the idea is to highlight the situation when writers barge in to evict, could the eviction lock queue length (ReentrantLock.getQueueLength()) be a useful metric? Or spinning on the lock with tryLock() and a timeout in order to increment a "evictionStalled" counter?

@ben-manes
Copy link
Owner

Gnawing at this for a while and I think adding a log warning is the best approach. As we assume the eviction lock will be held briefly, though frequently acquired, then a case where a thread is forcibly blocked for an extended amount of time is a reasonable warning to notify on. If a workload expects this then they can suppress the warning in their logger configuration. When the timeout occurs the thread will still block as it cannot skip the work, but the warning will help highlight a possible problem with the application's usage.

ben-manes added a commit that referenced this issue Feb 27, 2022
Eviction occcurs under an exclusive lock which is typically held for
very short periods to update the policy and possibly remove a victim
entry. Writes are applied to the hash table and the policy updated are
buffered, allowing the writer threads to schedule their work, tryLock
to update the policy, and move on if busy. If the writer buffer then
becomes full then to avoid a memory leak the writers must block to
assist, creating back pressure if the write rate exceeds the eviction
rate.

The eviction's removal of the victim from the hash table may cause
unexpected blocking. A map computation performs its work under the
same lock guarding the entry, which allows for atomicity of that
operation and requiring that other writes to the entry wait until
it completes. Typically this is quick, as caches are ready-heavy and
the victim entry is unlikely to be computed on. However, since the
locking in ConcurrentHashMap is based on the hash bin, not the entry,
a hash collision can cause writes to different keys to delay each
other. A slow, long-running computation then blocks eviction even
though the entries differ. When this happens then the writer buffer
fills up and other writes are blocked, causing no more write activity
until the eviction is allowed to proceed again.

That scenario goes against the advice of this library and the JavaDoc in
ConcurrentHashMap, which encourages short and fast computations. That is
milliseconds to seconds in practice, not minutes or hours. Instead we
offer AsyncCache to decouple the computation from the map, which
sacrafices linearizability for more efficient processing.

Of course few users will be aware of these implementation details to
make that decision early enough. Sadly some will find out only when they
observe production problems. To assist we now include a log warning in
hopes that it helps highlight it earlier, helps debugging, and hints
towards the required fix.
@ben-manes
Copy link
Owner

I updated afterWrite to log when the write buffer is full, it must acquire the eviction lock, and the wait time is excessive. While usually a compute should be a few seconds or less, the threshold is set to a 30 seconds simply to give a wide margin of what should be a reasonable time. If the computation is taking such a long time then the cache is not optimally configured for the application, should be adjusted, and can hopefully be detected during development. The writer will log once with a timeout exception for the usage stacktrace and then blocks unconditionally so as to avoid excessive log flooding.

I am still considering the wording for documentation updates, so that will come a bit later. Unfortunately I think most developers will only realize this problem once it occurs, so I tried to be clear in the log message and code comments. Thank you for raising this concern and sorry that it bit you. Hopefully adjusting your usage to avoid this shouldn't be too difficult.

@ben-manes
Copy link
Owner

Released 3.0.6 with this warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants