Some new findings on the "dead lock" in the previous issues #768

HashZhang · 2022-09-04T03:24:36Z

I have something different to share, and I think this is not a deadlock and it is a problem related to the eviction mechanism. We also observed the problem of almost all threads are waiting with the thread dump:

"XNIO-2 task-5" #329 prio=5 os_prio=0 cpu=56327.98ms elapsed=6392.53s tid=0x0000fffeac39b900 nid=0x18f waiting on condition  [0x0000fffd5b0b8000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
	- parking to wait for  <0x00000006fc8bc8b0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(java.base@17.0.2/LockSupport.java:211)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:715)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:938)
	at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@17.0.2/ReentrantLock.java:153)
	at java.util.concurrent.locks.ReentrantLock.lock(java.base@17.0.2/ReentrantLock.java:322)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1458)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1382)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2471)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2387)
	at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
	at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
.......................................................

We are using caffeine with version: 2.9.3

with jdk version 17 as you can see in the thread stack. Our cache configuration is:

Caffeine.newBuilder().expireAfterWrite(1, TimeUnit.SECONDS).build();

We enabled the JFR and observed something very interesting:

From the view of thread dump, almost all threads were waiting on the "eviction lock", this lock is acquired when: (1) the value of key is computed and then execute afterWrite() which is almost all my threads are waiting. (2) The eviction task executed default by common ForkJoinPool, which got the thread stack as below:


"ForkJoinPool.commonPool-worker-2" #254 daemon prio=5 os_prio=0 cpu=102959.38ms elapsed=6394.88s tid=0x0000fffee80133c0 nid=0x141 waiting for monitor entry  [0x0000fffe0077e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.util.concurrent.ConcurrentHashMap.computeIfPresent(java.base@17.0.2/ConcurrentHashMap.java:1819)
	- parking to wait for  <0x00000006ee8b6d40> (a java.util.concurrent.ForkJoinPool)
	- waiting to lock <0x00000006f047c0c8> (a java.util.concurrent.ConcurrentHashMap$Node)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntry(BoundedLocalCache.java:912)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.expireAfterWriteEntries(BoundedLocalCache.java:836)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.expireEntries(BoundedLocalCache.java:785)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1491)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1460)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3359)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.exec(BoundedLocalCache.java:3346)
	at java.util.concurrent.ForkJoinTask.doExec(java.base@17.0.2/ForkJoinTask.java:373)
	at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@17.0.2/ForkJoinPool.java:1182)
	at java.util.concurrent.ForkJoinPool.scan(java.base@17.0.2/ForkJoinPool.java:1655)
	at java.util.concurrent.ForkJoinPool.runWorker(java.base@17.0.2/ForkJoinPool.java:1622)
	at java.util.concurrent.ForkJoinWorkerThread.run(java.base@17.0.2/ForkJoinWorkerThread.java:165)

as shown above, the eviction lock is hold by this thread but it is blocked on ConcurrentHashMap$Node. Let's take a closer look through JFR:

From the figure we can see that, this thread is blocking for 1.371s. And this lock is previously hold by XNIO2-task-7, which is loading value to this cache from JFR:

Therefore, the problem is caused by:

Assume we have 64 threads reading from this cache using: cache.get(key, k -> {....}), cache has expire conf.
All threads are loading different keys, some of them(we name them as group 1) finished computing the key and execute afterWrite, some of them(we name them as group 2) are computing the key.
ForkJoinPool.common execute the eviction task, if it acquired the eviction lock, it will block all threads in group 1. And while it is checking (using ConcurrentHashMap.computeIfPresent) the key which is also computing by group 2, it will be blocked. If it takes too much time for ForkJoinPool.common waiting for group 2(maybe there is continuous transition to group 2), all group 1 thread will also be blocked for a long time and it seems like a dead lock

The text was updated successfully, but these errors were encountered:

HashZhang · 2022-09-04T03:25:15Z

#203
#527

HashZhang · 2022-09-04T03:27:04Z

#506

HashZhang · 2022-09-04T03:34:14Z

#672

ben-manes · 2022-09-04T03:35:48Z

Unfortunately this is a known issue that cannot be mitigated directly. If the eviction is blocked on a computation to remove the entry, which is due to the hashbin being locked by another computation, then the eviction must wait. If that is a very long running operation then writes will accumulate, create backpressure, and further cache writes will be blocked to avoid runaway growth. A log warning was added in v3.0.6 but has not been released in a backport yet.

Both Caffeine and ConcurrentHashMap hint to this problem by saying,

* Some attempted update operations on this cache by other threads may be blocked while the
* computation is in progress, so the computation should be short and simple, and must not attempt
* to update any other mappings of this cache.

The cache cannot detect this case as there is no tryCompute type of operation in ConcurrentHashMap. Instead we recommend decoupling the computation from the hash table's locks. That can be done by using AsyncCache, so the map operation simply adds or removes a CompletableFuture.

When the cache detects this problem in v3 it logs the following,

The cache is experiencing excessive wait times for acquiring the eviction lock. This may indicate that a long-running computation has halted eviction when trying to remove the victim entry. Consider using AsyncCache to decouple the computation from the map operation.

HashZhang · 2022-09-04T04:08:57Z

Unfortunately this is a known issue that cannot be mitigated directly. If the eviction is blocked on a computation to remove the entry, which is due to the hashbin being locked by another computation, then the eviction must wait. If that is a very long running operation then writes will accumulate, create backpressure, and further cache writes will be blocked to avoid runaway growth. A log warning was added in v3.0.6 but has not been released in a backport yet.

Both Caffeine and ConcurrentHashMap hint to this problem by saying,
* Some attempted update operations on this cache by other threads may be blocked while the
* computation is in progress, so the computation should be short and simple, and must not attempt
* to update any other mappings of this cache.
The cache cannot detect this case as there is no tryCompute type of operation in ConcurrentHashMap. Instead we recommend decoupling the computation from the hash table's locks. That can be done by using AsyncCache, so the map operation simply adds or removes a CompletableFuture.

When the cache detects this problem in v3 it logs the following,

The cache is experiencing excessive wait times for acquiring the eviction lock. This may indicate that a long-running computation has halted eviction when trying to remove the victim entry. Consider using AsyncCache to decouple the computation from the map operation.

noted with many thanks. We'll consider that

HashZhang closed this as completed Sep 4, 2022

ben-manes mentioned this issue Sep 5, 2022

hill climbing bitfaster/BitFaster.Caching#191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some new findings on the "dead lock" in the previous issues #768

Some new findings on the "dead lock" in the previous issues #768

HashZhang commented Sep 4, 2022 •

edited

HashZhang commented Sep 4, 2022

HashZhang commented Sep 4, 2022

HashZhang commented Sep 4, 2022

ben-manes commented Sep 4, 2022

HashZhang commented Sep 4, 2022

Some new findings on the "dead lock" in the previous issues #768

Some new findings on the "dead lock" in the previous issues #768

Comments

HashZhang commented Sep 4, 2022 • edited

HashZhang commented Sep 4, 2022

HashZhang commented Sep 4, 2022

HashZhang commented Sep 4, 2022

ben-manes commented Sep 4, 2022

HashZhang commented Sep 4, 2022

HashZhang commented Sep 4, 2022 •

edited