[SOLR-16489] CaffeineCache puts thread into infinite loop #1118

tboeghk · 2022-10-24T15:43:02Z

https://issues.apache.org/jira/browse/SOLR-16489

Description

Under heavy machine load, accessing any Solr CaffeineCache can lead to threads spinning in an endless loop. In our setup we had machines spinning at 70% cpu load without receiving any request for hours.

Solution

The problem is caused by delegating the SolrCache#put method to Cache#asMap#put Under heavy machine load with concurrent read and write access to the same key, the ConcurrentHashMap#put method does not terminate an endless loop.

Tests

The behavior happens under heavy load when the cache is under congestion and read / write access is happening in different threads. We had this small fix live for a couple of weeks now without any stuck threads.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

risdenk · 2022-10-24T16:50:58Z

solr/core/src/java/org/apache/solr/search/CaffeineCache.java

@@ -275,7 +275,7 @@ public V computeIfAbsent(K key, IOFunction<? super K, ? extends V> mappingFuncti
  @Override
  public V put(K key, V val) {
    inserts.increment();
-    V old = cache.asMap().put(key, val);
+    V old = cache.asMap().compute(key, (k, v) -> val);


does this need to use .asMap()? can we just use cache.get(key, mapping_function)? This looks to be the same concept w/o trying to access the underlying map.

https://www.javadoc.io/static/com.github.ben-manes.caffeine/caffeine/3.1.1/com.github.benmanes.caffeine/com/github/benmanes/caffeine/cache/Cache.html#get(K,java.util.function.Function)

if using asMap() - compute looks correct based on https://www.javadoc.io/static/com.github.ben-manes.caffeine/caffeine/3.1.1/com.github.benmanes.caffeine/com/github/benmanes/caffeine/cache/Cache.html#asMap(), but it looks like we don't even need asMap().

I wanted to stay in line with the other methods who also call cache.asMap()

ben-manes · 2022-10-24T18:54:36Z

Is this perhaps related to the JIT loop unrolling bug (ben-manes/caffeine#797, SOLR-16463)? (/cc @uschindler)

If you think this is instead a bug in Caffeine, please try to provide a reproducer so that we can isolate a fix. When I run a stress and Lincheck tests it is fine, but these can be tricky. The put is optimized as a retry look instead of a pessimistic compute (2x faster), but in a microbenchmark so likely not important in an actual workload and correctness is more important.

It looks like asMap().compute is the correct replacement as Cache.get is the same as Map.computeIfAbsent and this inserts or replaces an existing value.

ben-manes · 2022-10-24T18:59:11Z

ConcurrentHashMap#put method does not terminate an endless loop.

This could be a livelock if Solr is performing recursive writes (e.g. a write within a computation). That is not supported and should fail fast with an error. The last case that I know of that did not error early was fixed in jdk12, but prior versions could spin forever. That last case was related to modifying the map's size counters.

tboeghk · 2022-10-24T18:59:27Z

We came across this bug using JDK11 and Solr 8.11.2, so this is unlikely the same as ben-manes/caffeine#797.

ben-manes · 2022-10-24T19:00:07Z

Can you try jdk12? Does the stacktrace look to be related to the linked patch?

tboeghk · 2022-10-24T19:01:24Z

We ran into the bug using JDK11, JDK17 and JDK19. I'll try to dig out the stack trace / thread dump the threads got stuck in!

tboeghk · 2022-10-24T19:13:26Z

The threads got stuck in this infinite loop in BoundedLocalCache. I'll ask @DennisBerger1984 and @mpetris for a detailed stack trace tomorrow.

uschindler · 2022-10-24T22:32:35Z

Is this perhaps related to the JIT loop unrolling bug (ben-manes/caffeine#797, SOLR-16463)? (/cc @uschindler)

Unlikely. The JIT loop unrolling bug is according to my knowledge a false assert. The generated code looks correct, it just hits an assertion recently added to Hotspot (JDK 17).

The problem here seems to exist since 11.

uschindler · 2022-10-24T22:34:09Z

The threads got stuck in this infinite loop in BoundedLocalCache. I'll ask @DennisBerger1984 and @mpetris for a detailed stack trace tomorrow.

Actually this loop sometimes hits the assertion recently added in JDK 17. I'd rewrite that infinite loop to something more "CAS" like.

ben-manes · 2022-10-25T00:44:37Z

Actually this loop sometimes hits the assertion recently added in JDK 17. I'd rewrite that infinite loop to something more "CAS" like.

Unfortunately the challenge with this method is that ConcurrentHashMap is pessimistic in its write operations by locking before searching for the entry in the hash bin (or segment in 5-7). Doug had concerns about a full prescreen due to interactions with biased locking, safepoints, and GC pressure. He did eventually add a partial prescreen for the head of the bin, but that cannot be relied upon as the table is designed to prefer a high collision rate (cheap/poor hash offset by red-black tree bins).

CAS-style would be a putIfAbsent loop, so it would lock even if unsuccessful. In that case then the loop could be removed as locking is the costly work, so a compute would be preferable. Some users have assumed that putIfAbsent is a cheap read if present (e.g. apache/openwhisk#2797), so the cache tries to read and fallback to locking only when appropriate.

You can think of this as a TTAS loop which has a similar speedup in comparison to the more straightforward TAS approach.

uschindler · 2022-10-25T07:41:38Z

Hi Ben,
thanks for the insight. My comment was not directly about the type of loop (CAS, TTAS,...) it was more about the coding style. Hotspot sometimes tries to optimize loops where it thinks it has some common pattern. My proposal would be to mabye form the loop in a way that it follows a common pattern (like mentioned CAS, or use a while with condition up-front,...). This one is a for(;;) with very strange exit conditions and I had the feeling it causes some bugs in hotspot and maybe also triggers the infinite loop. In Lucene's bug we had some similar problems where loop invariants were incorrectly guessed.

Of course if this is a bug in your code leading to the infinite loop, then it is not a hotspot problem. What I do not understand: How does the current PR fixes the bug?

tboeghk · 2022-10-25T07:53:55Z

@uschindler the PR avoids the put() method which leads to the infinite loop by using the compute() method with a supplier. This will not reach the infinite loop.

ben-manes · 2022-10-25T08:11:42Z

We will need to see the stacktrace and debug details. Caffeine’s loop retries only if the entry was a stale read and once locked is determined to have been already removed from the map. That state transition can only happen in the lock and within a map removal operation. That leaves a small window for a stale read, with the caveat of Map.remove(key). That method discards a refresh, which could block if something weird happened during registration. Since refresh isn’t being used here, I cannot deduce any reason for a bug in Caffeine’s code. As ConcurrentHashMap also loops forever, and was the originally claimed source, we need the details to assess a possible cause.

DennisBerger1984 · 2022-10-25T08:22:18Z

If I remember correctly the "prior" in line 2203 of BoundedLocalCache in caffeineCache 3.3.1 was always non null and had the keyReference DEAD_STRONG_KEY such that the isAlive() check always returned false in line 2248. The thread spins in the enclosing for loop forever.

DennisBerger1984 · 2022-10-25T08:28:51Z

Would a stacktrace help? To reproduce the problem I would need to deploy a cluster on our side as we were not able to trigger the bug with a test case. However with a gatling loadtest we were able to reproduced it reliably. To set everything up I need about 4 hours. We were able to pin point the endless loop by attaching to the debug endpoint and pause all threads.

ben-manes · 2022-10-25T16:35:41Z

Thanks @DennisBerger1984.

We came across this bug using JDK11 and Solr 8.11.2
If I remember correctly ... in caffeineCache 3.3.1

I see that Solr 8.11.2 is using 2.9.2, whereas currently it is using 3.1.1. Can we confirm that your failure happens using 3.3.1? I don't recall any related bug fixes, but it would help us narrow in on the problem.

DennisBerger1984 · 2022-10-25T17:06:06Z

Hi @ben-manes,
I can confirm that. We've bumped the version in our gradle build and I just verified that in the resulting docker image the actual jar is 3.1.1.

ben-manes · 2022-10-27T04:04:03Z

If you do come by a stacktrace please let me know. I am not having luck in a review of the code, but a fresh set of eyes is always welcome. If you can reproduce then look for other exceptions, like out of memory error or stackoverflow, which could leave an invalid state. In those cases the jvm may not be recoverable, e.g. ExitOnOutOfMemoryError flag.

ben-manes · 2022-10-28T17:37:17Z

Thanks to @DennisBerger1984's help, we met to step through the debug session of the cache in this invalid state. Dennis captured this jstack output. We confirmed the findings that the filterCache is stuck in an infinite loop due to retrieving a dead entry. When pairing on a code walkthrough, we came up with a scenario that could explain this problem and would require some Solr experts to advise us on.

In short, is the cache key stable in filter cache? This is a org.apache.lucene.search.Query object and appears to be mutable. Can a query be modified after it was inserted into the cache? If so, then the cache might fail to remove it during eviction, eagerly mark it as dead, and rediscover it later by a lookup. As modifiable keys are violations to the Map contract it is not detected or handled.

During eviction, Caffeine uses a computeIfPresent to remove the victim entry. If the entry was not found in the map, then we assume it was due to an explicit, concurrent removal. This allows the cache to eagerly decrement the current size and avoid unnecessary additional evictions to be within bounds. If a concurrent removal then a RemovalTask is queued and will no-op when run. However if the key's hash/equals were modified then eviction would instead leave a dead entry in the map, and a subsequent lookup might discover it by a map lookup.

@DennisBerger1984 suggested that the filter cache could use an immutable cache key object. Given the subtypes, Is there an easy way to do this? Would using toString() result in a key with the right uniqueness properties? If we can apply this change then Dennis will retest and we'll see if our hypothesis holds.

uschindler · 2022-10-28T17:41:08Z

Query is by default an immutable object. So there should not be a separate cache key.
If you found a buggy query with unstable hashcode/equals, please report this at Lucene. This is a bug. Please no workarounds here!

ben-manes · 2022-10-28T17:47:34Z

Thanks @uschindler. I think we need to confirm this hypothesis first. Is there an easy way to make an immutable copy for us to test with? The actual fix might be different, but this would be the easiest way to verify our guess.

uschindler · 2022-10-28T17:52:46Z

There is no clone method for queries, sorry. I don't know what the best way is to solve this. What type of query was causing this?

But the bug with infinite loop should be fixed, so it throws some exception.

uschindler · 2022-10-28T17:54:19Z

You may temporary use toString as cache key. But this is not working safely for all query types.

ben-manes · 2022-10-28T18:02:18Z

I believe it was a WrappedQuery holding a complex BooleanQuery, but we did not look enough. I only recall that the key equality was complex with hashmap comparisons, and when we inspected the key it had setters and non-final fields. We didn't think of capturing screenshots of the object states when debugging. Dennis might be able to offer more details.

I will think more about how to handle the infinite loop in this scenario. A retry is needed if a stale read, e.g. the reader might lookup the entry, be context switched, it evicted, and then observe it to be dead. That's fine and a subsequent re-read should cause it to see a fresh result. I may be able to handle this case a little better, but since it can happen safely it is tricky.

uschindler · 2022-10-28T18:08:38Z

WrappedQuery is a solr class. Many of the queries there are old and we're never checked for final fields and correctness.
This would be a good job to start fixing the - sorry to be harsh - heavily broken legacy query bullshit in Solr. Solr should only use Lucene queries and not add those useless wrappers causing trouble.
In Lucene every query only has final fields and setters are forbidden. We had some problematic ones but this was fixed long ago.

uschindler · 2022-10-28T18:13:06Z

This query must die, please make fieldss final and remove the setter: https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/search/WrappedQuery.java

risdenk · 2022-10-28T18:14:40Z

@uschindler to make sure I understand - anything in Solr that extends from https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/Query.java should have no setters and final fields. effectively immutable. makes sense to me

I filed https://issues.apache.org/jira/browse/SOLR-16509 to fix these.

uschindler · 2022-11-02T08:29:16Z

For booleanQuery we have many tests in Lucene core and its own cache implementation. So it is three cases where it may break:

somebody calls getClauses() and modifies them (adding entries). We should prevent that by making the clauses unmodifiable in Lucene. I'll open an issue.
somebody modified one query inside the clauses. I would need to get all clauses of the above example
the WrappedQuery itsself is the issue. I would start there, see above.

risdenk · 2022-11-02T14:46:51Z

https://github.com/apache/solr/search?q=setWrappedQuery is definitely not used in Solr main.

DennisBerger1984 · 2022-11-02T14:59:11Z

@uschindler does this help?

uschindler · 2022-11-02T16:04:13Z

Yes this helps. Interestingly: The "clauses" of a BooleanQuery are non-mutable, it is already wrapped with Collections.unmodifiableList(). So nobody can change the clauses after calling BooleanQuery.getCauses().

ben-manes · 2022-11-02T16:15:43Z

oh yes, I did not notice that Collection<Query> getClauses(Occur occur) is package-private. This is returning a mutable collection from clauseSets, but since it is not public then that is safe (maybe package-private for testing).

uschindler · 2022-11-02T16:18:32Z

This is a non-Lucene or non-Solr query from the external Querqy plugin.

https://github.com/querqy/querqy/blob/1dbafde2eddc0938b309bbe955f127148671fd41/querqy-for-lucene/querqy-lucene/src/main/java/querqy/lucene/rewrite/FieldBoostTermQueryBuilder.java

It looks fine, too. It also implements equals and hashCode.

uschindler · 2022-11-02T16:26:51Z

DisjunctionMaxQuery unfortunately has a getDisjuncts that returns a mutable Multiset (a Lucene collection subclass): https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxQuery.java

DennisBerger1984 · 2022-11-11T10:02:14Z

Hey should we provide additional information like tests without querqy plugin or is it sufficient to wait for ticket #1146?
What do you think?

uschindler · 2022-11-11T10:19:02Z

Hey should we provide additional information like tests without querqy plugin or is it sufficient to wait for ticket #1146? What do you think?

If you could temporarily disable querqy it would be a good test. I was not able to find all the queries mentioned. At least we could rule out or confirm that querqy is the problem.

The cleanup work in #1146 is good to have, but as WrappedQuery's setQuery was not called from code it wa sin fact immutable.

I still have some problems with the DisjunctionMaxQueries.... (they are created by Querqy), this is a Lucene class and not fully sealed against changes in their clauses. Maybe Querqy modifies them, Solr at least does not do this.

renekrie · 2022-11-11T11:23:10Z

@DennisBerger1984 Are you using the newish multiMatchTie as a Querqy query parameter (related to querqy/querqy#281)? Also, are you setting uq.similarityScore or qboost.similarityScore (see https://querqy.org/docs/querqy/more-about-queries.html#reference) - I'm trying to think of a place where we are manipulating queries more heavily. On the other hand, we never touch anything after the 'parse' step and we never read from the cache, so I can't see how we could touch anything that has already been put into the cache.

DennisBerger1984 · 2022-11-11T11:58:24Z

Hi René, qboost.similarityScore is off, multiMatchTie is set to 0.2. I'm currently running a test with defType=edismax to futher drill down.

renekrie · 2022-11-11T12:57:13Z

Thanks @DennisBerger1984! There is a smell related to multiMatchTie - querqy/querqy#384. This FieldBoost implementation should be used under the hood with multiMatchTie under certain conditions and be referenced from queries and it misses hashcode/equals (though I cannot see SingleFieldBoost in your object graph). It should still result just in a cache miss and not trigger a loop. Maybe for you test, you could just remove the multiMatchTie param in a first step.

DennisBerger1984 · 2022-11-11T15:24:42Z

When using defType=edismax the problem doesn't occur. Keeping querqy as defType but removing multiMatchTie doesn't help. But to be honest multiMatchTie was disabled anyway I think.

uschindler · 2022-11-11T16:58:08Z

A completely missing equals/hashCode should not bring problems, because the cache would simply not work. For the bug to occur the hashcode or equals must suddenly change for an existing query.

uschindler · 2022-11-11T17:01:16Z

For current lucene queries the problem may occur when you pass a BytesRef (it is a ref) only while building the query. If you change the bytesRef afterwards it fails.

IMHO queries should clone the terms passed in. Solr has the big problem of over-reusing mutable references, although this is really no problem anymore as JVM can handle that with escape analysis and won't allocate instances. Actually "reusing" is bad in newer java versions, as it will populate the heap with useless temporary (but reused) objects.

renekrie · 2022-11-11T18:01:48Z

For current lucene queries the problem may occur when you pass a BytesRef (it is a ref) only while building the query. If you change the bytesRef afterwards it fails.

I briefly checked this but as far as I can see we are always using a fresh BytesRef. Manipulations at the Querqy level (= before going to the analysis chain) happen via character arrays anyway and they aren't being used any longer once a Lucene query has been produced.

@DennisBerger1984 Can I ask you to disable query rewriters or/and use empty rule sets with rewriters still being enabled to narrow down the problem?

debe · 2022-11-11T18:49:25Z

@renekrie the querqy.rewriters are completely empty already during this tests

ben-manes · 2022-11-23T21:03:58Z

Released caffeine 3.1.2 with this error detection

risdenk · 2022-11-28T19:30:46Z

I opened SOLR-16562 to upgrade to Caffeine 3.1.2 which will detect infinite loops (not fix the issue).

risdenk · 2023-01-18T17:46:18Z

@tboeghk @DennisBerger1984 @mpetris what is the status here? Did you end up finding the Query impl that is changing hashcode/equals? Is this PR necessary still?

DennisBerger1984 · 2023-01-30T12:50:58Z

Hi @risdenk I've been rather busy lately. In the mean time did we upgrade to caffeine 3.1.2 and removed our workaround. Since then we did not encountered the problem again, but neither did we saw contract violated messages from caffeine. So we didn't find it, sadly.

janhoy · 2023-02-15T15:17:41Z

This PR is no longer needed when we have the upgraded CaffeineCache version, or not?

ben-manes · 2023-02-15T15:26:05Z

PR is not necessary but was the underlying bug fixed? The issue was corrupting the cache by modifying the key’s equality while inside of a ConcurrentHashMap. The upgrade just yells if detected, but it was a usage error.

DennisBerger1984 · 2023-02-24T08:14:32Z

All in all it's very hard for us to reproduce the bug reliably. I've seen a couple "An invalid state was detected" message two weeks ago and thought now I've found an offending query, but unfortunatbut they've disappeared and the last week was free of those messages. I've scanned around 1TB and 100 million log entries.

github-actions · 2024-02-18T00:00:34Z

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the dev@solr.apache.org mailing list. Thank you for your contribution!

uschindler · 2024-02-18T11:36:25Z

I think we should close this PR. The underlying fixed impl now yells if an issue was detected and no longer hangs.

If the warning message is still printed, it's a broken equals/hashcode and should be taken care of separately.

tboeghk added 2 commits October 24, 2022 16:56

[SOLR-16489] fixes possible endless loop in Caffeine

322521d

[SOLR-16489] fix format warnings

7db74ef

tboeghk changed the title ~~Issue/solr 16489~~ [SOLR-16489] CaffeineCache puts thread into infinite loop Oct 24, 2022

risdenk reviewed Oct 24, 2022

View reviewed changes

risdenk mentioned this pull request Nov 29, 2022

SOLR-16562: Upgrade to Caffeine 3.1.2 #1198

Merged

ben-manes mentioned this pull request Dec 11, 2023

com.github.benmanes.caffeine.cache.SSLSMWA consuming very high memory. ben-manes/caffeine#1356

Closed

github-actions bot added the stale PR not updated in 60 days label Feb 18, 2024

uschindler closed this Feb 18, 2024

[SOLR-16489] CaffeineCache puts thread into infinite loop #1118

[SOLR-16489] CaffeineCache puts thread into infinite loop #1118

Conversation

tboeghk commented Oct 24, 2022 • edited

Description

Solution

Tests

Checklist

risdenk Oct 24, 2022

Choose a reason for hiding this comment

tboeghk Oct 24, 2022

Choose a reason for hiding this comment

ben-manes commented Oct 24, 2022

ben-manes commented Oct 24, 2022

tboeghk commented Oct 24, 2022

ben-manes commented Oct 24, 2022

tboeghk commented Oct 24, 2022

tboeghk commented Oct 24, 2022

uschindler commented Oct 24, 2022

uschindler commented Oct 24, 2022

ben-manes commented Oct 25, 2022 • edited

uschindler commented Oct 25, 2022

tboeghk commented Oct 25, 2022

ben-manes commented Oct 25, 2022

DennisBerger1984 commented Oct 25, 2022

DennisBerger1984 commented Oct 25, 2022

ben-manes commented Oct 25, 2022

DennisBerger1984 commented Oct 25, 2022 • edited

ben-manes commented Oct 27, 2022

ben-manes commented Oct 28, 2022

uschindler commented Oct 28, 2022

ben-manes commented Oct 28, 2022

uschindler commented Oct 28, 2022

uschindler commented Oct 28, 2022

ben-manes commented Oct 28, 2022

uschindler commented Oct 28, 2022 • edited

uschindler commented Oct 28, 2022

risdenk commented Oct 28, 2022 • edited

uschindler commented Nov 2, 2022

risdenk commented Nov 2, 2022

DennisBerger1984 commented Nov 2, 2022

uschindler commented Nov 2, 2022

ben-manes commented Nov 2, 2022 • edited

uschindler commented Nov 2, 2022

uschindler commented Nov 2, 2022

DennisBerger1984 commented Nov 11, 2022

uschindler commented Nov 11, 2022

renekrie commented Nov 11, 2022 • edited

DennisBerger1984 commented Nov 11, 2022

renekrie commented Nov 11, 2022

DennisBerger1984 commented Nov 11, 2022

uschindler commented Nov 11, 2022

uschindler commented Nov 11, 2022 • edited

renekrie commented Nov 11, 2022

debe commented Nov 11, 2022

ben-manes commented Nov 23, 2022

risdenk commented Nov 28, 2022

risdenk commented Jan 18, 2023

DennisBerger1984 commented Jan 30, 2023

janhoy commented Feb 15, 2023

ben-manes commented Feb 15, 2023

DennisBerger1984 commented Feb 24, 2023

github-actions bot commented Feb 18, 2024

uschindler commented Feb 18, 2024

tboeghk commented Oct 24, 2022 •

edited

ben-manes commented Oct 25, 2022 •

edited

DennisBerger1984 commented Oct 25, 2022 •

edited

uschindler commented Oct 28, 2022 •

edited

risdenk commented Oct 28, 2022 •

edited

ben-manes commented Nov 2, 2022 •

edited

renekrie commented Nov 11, 2022 •

edited

uschindler commented Nov 11, 2022 •

edited