Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalLoadingCache.getAll deadlock #771

Closed
seyoatda opened this issue Sep 11, 2022 · 12 comments
Closed

LocalLoadingCache.getAll deadlock #771

seyoatda opened this issue Sep 11, 2022 · 12 comments

Comments

@seyoatda
Copy link

seyoatda commented Sep 11, 2022

hi there, our service is using caffeine as cache, recently some of the service pod met the deadlock.

As the jstack log provided bellow, all the other LocalLoadingCache.getAll thread were blocked by this thread.

"qtp727001376-25915" #25915 prio=5 os_prio=0 tid=0x00007f3c5406f000 nid=0x6593 runnable [0x00007f3cdebf5000]
   java.lang.Thread.State: RUNNABLE
	at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1027)
	- locked <0x000000054a8007d0> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
	at java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1535)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1662)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1617)
	at com.github.benmanes.caffeine.cache.LocalLoadingCache.lambda$bulkLoad$0(LocalLoadingCache.java:129)
	at com.github.benmanes.caffeine.cache.LocalLoadingCache$$Lambda$253/779365132.accept(Unknown Source)
	at java.util.HashMap.forEach(HashMap.java:1289)
	at com.github.benmanes.caffeine.cache.LocalLoadingCache.bulkLoad(LocalLoadingCache.java:128)
	at com.github.benmanes.caffeine.cache.LocalLoadingCache.loadInBulk(LocalLoadingCache.java:114)
	at com.github.benmanes.caffeine.cache.LocalLoadingCache.getAll(LocalLoadingCache.java:72)

I'm using version 2.6.2.
I've seen the issue here: #506, but it seems a little different from my issue.

So I wonder is this a bug of this old version, or is this caused by our own code?
Thanks in advance~

@ben-manes
Copy link
Owner

ben-manes commented Sep 11, 2022

ConcurrentHashMap$ReservationNode is a "place-holder node used in computeIfAbsent and compute." This means that the put call is blocked waiting for a long-running computation to complete, e.g. from your CacheLoader.load. As the internals of ConcurrentHashMap are to lock at a hashbin rather than per entry, it may be an indirect collision due to hashing entries into the same segment. Because of this, the map warns: "Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple."

For those cases of long computations, the recommendation is to decoupling that work from the hash table's operation. This could be done by using AsyncCache where the map entry is a CompletableFuture. It also has the benefit of allowing getAll to protects against cache stampedes.

I think the actual problem in #506 was discovered later #203 (comment) that those users had a StackOverflowError which caused the locks to be left in corrupted state (JEP 270).

@seyoatda
Copy link
Author

seyoatda commented Sep 12, 2022

ConcurrentHashMap$ReservationNode is a "place-holder node used in computeIfAbsent and compute." This means that the put call is blocked waiting for a long-running computation to complete, e.g. from your CacheLoader.load. As the internals of ConcurrentHashMap are to lock at a hashbin rather than per entry, it may be an indirect collision due to hashing entries into the same segment. Because of this, the map warns: "Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple."

For those cases of long computations, the recommendation is to decoupling that work from the hash table's operation. This could be done by using AsyncCache where the map entry is a CompletableFuture. It also has the benefit of allowing getAll to protects against cache stampedes.

I think the actual problem in #506 was discovered later #203 (comment) that those users had a StackOverflowError which caused the locks to be left in corrupted state (JEP 270).

thanks for the reply. But I do not understand why the ReservationNode was locked, since computeIfAbsent doesn't appear in the stack chain. And I'm not using compute & computeIfAbsent in cache loading.

Does putIfAbsent lock the ReservationNode as well? I have read the original code about ConcurrentHashMap#putIfAbsent but could not find any code related to ReservationNode in it.

BTW, I think my initial conclusion is wrong, this is not a deadlock, but a thread contention.

@ben-manes
Copy link
Owner

The cache internally uses compute methods for the CacheLoader, eviction, etc. If you are calling LoadingCache.get(key) then that will use a map computation if not found by an optimistic prescreen (to avoid pessimistic locks). A compute in a different thread's stack should show up to correspond with this blocking. You are welcome to attach or send the full trace to me privately, if I can be of help.

If the cache's maximum size is tiny then you might also be hitting excessive contention because the number of locks depends on the map's size. This is generally okay because when the map resizes to grow more locks are added, but a small cache would cap the growth. You can set a larger initial capacity so that ConcurrentHashMap allocates enough distinct locks to avoid this problem.

@ben-manes
Copy link
Owner

Any updates?

@seyoatda
Copy link
Author

seyoatda commented Sep 16, 2022

Any updates?

so sorry for the late reply. I've been so busy these days.
In general, I think this contention could be cause by another cache.get() nested in cache.load(), the nested cache may expire at some time, and it will send a http request to fetch the new data.
Since our service's QPS is high, this may cause thread contentions sometimes.

Anyway, thanks a lot for your reply. It really helps me. 👍👍👍
And I should learn more about the details of how ConcurrentHashMap works

@ben-manes
Copy link
Owner

Thanks for the update! You might find refreshAfterWrite useful for keeping hot data fresh and allow cooler data to expire. That can help hide latency from user-facing requests.

@seyoatda
Copy link
Author

Sorry to comment again, but I eventually found that this retention was caused by the same cache's nested load method. It's hard to figure out because the code is kind of complicated.

I've seen this guava issue raised by you to solve the deadlock problem: guava issue, and it's still open, so both caffeine and guava have not detected the recursive load method and take a fast fail, right?

@ben-manes
Copy link
Owner

Caffeine relies on ConcurrentHashMap's fail fast detection. That should cover everything as of Jdk12, but previous releases had gaps and in jdk8 it simply live-locked. If you observe any in a later release then that is a new bug. Unfortunately I doubt this was backported if you are on an older jdk.

@seyoatda
Copy link
Author

Thanks. It seems that we have to detect the nested call by ourselves to avoid this kind of problem. It's easy to avoid when we are coding but hard when dealing with legacy codes. Besides, updating to a newer JDK version is also a big challenge.😂

@ben-manes
Copy link
Owner

ben-manes commented Sep 28, 2022

yeah, sorry it is so troublesome. I actually did catch it during the pre-Java 8 code review (2011), but unfortunately it wasn't fixed until I rediscovered it again (2014).

You can kind of work around it using AsyncCache faq but in an annoying, brittle way. Unfortunately tying the compute to the hash table means an addition triggering a resize can corrupt it (the HashMap authors knew of CHM's issues but thought recursive computes was safe for them, but were wrong and restricted it in Java 9 after table corruption).

@seyoatda
Copy link
Author

Hi, I have some new findings today, when we try to move to guava, the deadlock was resolved.

I found that caffeine uses ConcurrentHashMap.computeIfAbsent in the LocalLoadingCache.get(), which may cause a infinite loop when cache.get() nested with cache.getAll(), (caused by ConcurrentHashMap). but guava seems not to use computeIfAbsent(), but implements a method and thus avoids the problem of computeIfAbsent().
So I wonder why caffeine chooses to use computeIfAbsent() instead of writing a method to judge if the key exists first? like containsKey or something. Thanks in advance.

@ben-manes
Copy link
Owner

Caffeine can behave similarly, per the faq. In that factorial example, the computation is decoupled from the hash table and memoization is achieved by using a CompletableFuture. The example is a little more tedious by making sure to avoid extra threads. I suppose Loom would make that a non-issue when threads are free.

Guava forked Java 5's ConcurrentHashMap to add functionality, but this also meant it was stuck without any improvements. I didn't want to do that or write my own as that is a large distraction, source of bugs, and it decoupling computes would be similar to the behavior in AsyncCache so that seems like an okay workaround. Last I spoke with Doug he does plan on redesigning this from table synchronization to per-entry locks, which would act similarly to Guava. We did try to unofficially support recursion in Guava as seen in the wild even if not recommended. It's just kind of a messy problem and there is a lot of intertwined complexity from different needs and features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants