Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some new findings on the "dead lock" in the previous issues #768

Closed
HashZhang opened this issue Sep 4, 2022 · 5 comments
Closed

Some new findings on the "dead lock" in the previous issues #768

HashZhang opened this issue Sep 4, 2022 · 5 comments

Comments

@HashZhang
Copy link

HashZhang commented Sep 4, 2022

I have something different to share, and I think this is not a deadlock and it is a problem related to the eviction mechanism. We also observed the problem of almost all threads are waiting with the thread dump:

"XNIO-2 task-5" #329 prio=5 os_prio=0 cpu=56327.98ms elapsed=6392.53s tid=0x0000fffeac39b900 nid=0x18f waiting on condition  [0x0000fffd5b0b8000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
	- parking to wait for  <0x00000006fc8bc8b0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(java.base@17.0.2/LockSupport.java:211)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:715)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:938)
	at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@17.0.2/ReentrantLock.java:153)
	at java.util.concurrent.locks.ReentrantLock.lock(java.base@17.0.2/ReentrantLock.java:322)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1458)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1382)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2471)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2387)
	at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:108)
	at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:62)
.......................................................

We are using caffeine with version: 2.9.3

with jdk version 17 as you can see in the thread stack. Our cache configuration is:

Caffeine.newBuilder().expireAfterWrite(1, TimeUnit.SECONDS).build();

We enabled the JFR and observed something very interesting:

  1. From the view of thread dump, almost all threads were waiting on the "eviction lock", this lock is acquired when: (1) the value of key is computed and then execute afterWrite() which is almost all my threads are waiting. (2) The eviction task executed default by common ForkJoinPool, which got the thread stack as below:

"ForkJoinPool.commonPool-worker-2" #254 daemon prio=5 os_prio=0 cpu=102959.38ms elapsed=6394.88s tid=0x0000fffee80133c0 nid=0x141 waiting for monitor entry  [0x0000fffe0077e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.util.concurrent.ConcurrentHashMap.computeIfPresent(java.base@17.0.2/ConcurrentHashMap.java:1819)
	- parking to wait for  <0x00000006ee8b6d40> (a java.util.concurrent.ForkJoinPool)
	- waiting to lock <0x00000006f047c0c8> (a java.util.concurrent.ConcurrentHashMap$Node)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntry(BoundedLocalCache.java:912)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.expireAfterWriteEntries(BoundedLocalCache.java:836)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.expireEntries(BoundedLocalCache.java:785)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1491)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1460)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3359)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.exec(BoundedLocalCache.java:3346)
	at java.util.concurrent.ForkJoinTask.doExec(java.base@17.0.2/ForkJoinTask.java:373)
	at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.base@17.0.2/ForkJoinPool.java:1182)
	at java.util.concurrent.ForkJoinPool.scan(java.base@17.0.2/ForkJoinPool.java:1655)
	at java.util.concurrent.ForkJoinPool.runWorker(java.base@17.0.2/ForkJoinPool.java:1622)
	at java.util.concurrent.ForkJoinWorkerThread.run(java.base@17.0.2/ForkJoinWorkerThread.java:165)

as shown above, the eviction lock is hold by this thread but it is blocked on ConcurrentHashMap$Node. Let's take a closer look through JFR:

1662261106523

From the figure we can see that, this thread is blocking for 1.371s. And this lock is previously hold by XNIO2-task-7, which is loading value to this cache from JFR:
1662261328350

Therefore, the problem is caused by:

  1. Assume we have 64 threads reading from this cache using: cache.get(key, k -> {....}), cache has expire conf.
  2. All threads are loading different keys, some of them(we name them as group 1) finished computing the key and execute afterWrite, some of them(we name them as group 2) are computing the key.
  3. ForkJoinPool.common execute the eviction task, if it acquired the eviction lock, it will block all threads in group 1. And while it is checking (using ConcurrentHashMap.computeIfPresent) the key which is also computing by group 2, it will be blocked. If it takes too much time for ForkJoinPool.common waiting for group 2(maybe there is continuous transition to group 2), all group 1 thread will also be blocked for a long time and it seems like a dead lock
@HashZhang
Copy link
Author

#203
#527

@HashZhang
Copy link
Author

#506

@HashZhang
Copy link
Author

#672

@ben-manes
Copy link
Owner

Unfortunately this is a known issue that cannot be mitigated directly. If the eviction is blocked on a computation to remove the entry, which is due to the hashbin being locked by another computation, then the eviction must wait. If that is a very long running operation then writes will accumulate, create backpressure, and further cache writes will be blocked to avoid runaway growth. A log warning was added in v3.0.6 but has not been released in a backport yet.

Both Caffeine and ConcurrentHashMap hint to this problem by saying,

* Some attempted update operations on this cache by other threads may be blocked while the
* computation is in progress, so the computation should be short and simple, and must not attempt
* to update any other mappings of this cache.

The cache cannot detect this case as there is no tryCompute type of operation in ConcurrentHashMap. Instead we recommend decoupling the computation from the hash table's locks. That can be done by using AsyncCache, so the map operation simply adds or removes a CompletableFuture.

When the cache detects this problem in v3 it logs the following,

The cache is experiencing excessive wait times for acquiring the eviction lock. This may indicate that a long-running computation has halted eviction when trying to remove the victim entry. Consider using AsyncCache to decouple the computation from the map operation.

@HashZhang
Copy link
Author

Unfortunately this is a known issue that cannot be mitigated directly. If the eviction is blocked on a computation to remove the entry, which is due to the hashbin being locked by another computation, then the eviction must wait. If that is a very long running operation then writes will accumulate, create backpressure, and further cache writes will be blocked to avoid runaway growth. A log warning was added in v3.0.6 but has not been released in a backport yet.

Both Caffeine and ConcurrentHashMap hint to this problem by saying,

* Some attempted update operations on this cache by other threads may be blocked while the
* computation is in progress, so the computation should be short and simple, and must not attempt
* to update any other mappings of this cache.

The cache cannot detect this case as there is no tryCompute type of operation in ConcurrentHashMap. Instead we recommend decoupling the computation from the hash table's locks. That can be done by using AsyncCache, so the map operation simply adds or removes a CompletableFuture.

When the cache detects this problem in v3 it logs the following,

The cache is experiencing excessive wait times for acquiring the eviction lock. This may indicate that a long-running computation has halted eviction when trying to remove the victim entry. Consider using AsyncCache to decouple the computation from the map operation.

noted with many thanks. We'll consider that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants