upgrade jetty to 9.4.44.v20210927 #121

noblepaul · 2021-11-23T09:54:22Z

Most of our threads are waiting at

java.base@11.0.8/java.lang.Object.wait(Native Method) java.base@11.0.8/java.lang.Object.wait(Object.java:328) org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:318) org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:90) org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:99) org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:217) org.apache.solr.common.util.JavaBinCodec._init(JavaBinCodec.java:211) org.apache.solr.common.util.JavaBinCodec.initRead(JavaBinCodec.java:202) org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:195)

We find this issue suspiciously similar to what Jetty has fixed in 9.4.44

jetty/jetty.project#2570 (comment)

chatman · 2021-11-23T10:07:54Z

Other variant where our threads are stuck:

2021-11-08 00:01:06.485 ERROR (pool-2-thread-1) [    ] o.a.s.s.TimeOutPatrol Stack trace for httpShardExecutor-120-thread-2698527-processing-x:H2QQD http:////10.240.17.206:8983//solr//H2QQD_shard16_replica_n16// n:10.240.0.180:8984_solr c:H2QQD rid:18fc88214bd91462d78d82e764a954d5-77f52525a341357d [http:////10.240.17.206:8983//solr//H2QQD_shard16_replica_n16//]: [java.base@11.0.8/jdk.internal.misc.Unsafe.park(Native Method)
	java.base@11.0.8/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
	java.base@11.0.8/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1079)
	java.base@11.0.8/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1369)
	java.base@11.0.8/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:278)
	org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:214)
	org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:401)
	org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:761)
	org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
	org.apache.solr.handler.component.HttpShardHandler.request(HttpShardHandler.java:80)
	org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:136)
	org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:45)
	java.base@11.0.8/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	java.base@11.0.8/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	java.base@11.0.8/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)
	org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)
	org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$199/0x00000017feb29040.run(Unknown Source)
	java.base@11.0.8/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	java.base@11.0.8/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	java.base@11.0.8/java.lang.Thread.run(Thread.java:834)]

^ These threads get stuck in TIMED_WAITING.

chatman · 2021-11-23T10:09:37Z

This issue: jetty/jetty.project#6323, was fixed in https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.42.v20210604

chatman · 2021-11-23T10:24:54Z

I ran the unit tests for solr-core for this branch, I see no newer failures than release/8.8 branch.

The failures I see on both branches:
[junit4]
[junit4] Tests with failures [seed: 3A53112B710015DB]:
[junit4] - org.apache.solr.metrics.reporters.solr.SolrCloudReportersTest.testDefaultPlugins
[junit4] - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
[junit4] - org.apache.solr.search.facet.DistributedFacetSimpleRefinementLongTailTest.test

I feel safe merging this PR.

chatman · 2021-11-23T11:09:49Z

Another unrelated good reason to upgrade is jetty/jetty.project#6072 (Fixed in 9.4.39). This is covered in GHSA-26vr-8j45-3r4w. In Solr 8.9, Jetty was upgraded to a version (9.4.39) where this CVE was fixed, but the hang/stuck bug was fixed later (9.4.42).

chatman · 2021-11-23T11:23:18Z

Jetty was upgraded to this version in Solr 8.10.1 as per https://issues.apache.org/jira/browse/SOLR-15677. We should too.

hiteshk25 · 2021-11-23T15:53:56Z

@patsonluk can you please look this!

patsonluk · 2021-11-23T18:58:41Z

TLDR

The hang appears to be triggered by jetty under certain conditions, only occasionally
Similar hangs can be reproduced locally on my machine (both solr and load generation are on my mac book) by issuing requests continuously (even with just 1 thread) after a while (several minutes). Might need a solr collection with a lot of shards (1000 in our test). The stack traces observed are similar to previous finding. Test driver used : https://gist.github.com/patsonluk/4e5ee44a649c3cee07ae1cba1962654c
Such hang usually resolves itself at exactly 120 secs (detecting connection going away, trigger timeout exception), but in multiple thread load test, such timeout might not get triggered at 120 secs, sometimes it takes much longer before the hanging threads resume.
The reproduced hang (3 threads) is very similar to what was reported - long hang (> 10 mins) and Solr appears to be still healthy (can still serve other external requests). But the difference is that an exception is observed from this test, but probably not on the originally reported issue?
When the threads hang, and then later on resume, they all seem to resume at the same time, even if they started hanging from different time - this is also very similar to what we observed on prod

Long version

Similar issue can be reproduced by issue simple queries (such as http://127.0.0.1:8984/solr/fsloadtest/select?q=*%3A*) to the QA node repeatedly. My test environment is a single collector with 1000 shards (and 10m docs).

With a single threaded load generator, the QA node hangs after several minutes org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:318), it will eventually self recovered by timing out (at 120 secs), which is not exactly the same as what we observed (hangs for longer than 120 secs)

However, with the load generator using 3 threads, the QA node will start hanging on the httpShardExecutor threads one by one (could be several minutes apart). Eventually all 3 threads from the load generator would not get any response (> 120 secs). The QA node however, still appears to be healthy (still responsive to manual queries), by inspecting at the threads, they are indeed all hang at InputStreamResponseListener.java:318

The interesting thing is that, in "normal" execution path, there are only 2 threads of concern on the solr side:

The httpShardExecutor thread which lock and wait and 2 points - Countdown latch at InputStreamResponseListener.java:214 and Object lock at InputStreamResponseListener.java:318)
The jetty http2 thread (h2sc..., usually h2sc-121-thread-1 for single threaded loads) which unlock such 2 points - InputStreamResponseListener.onHeaders (line 100) that unlocks countdown latch and InputStreamResponseListener.onContent (line 124) that unlocks the Object lock

With debugging, it's found that for thread 2 (jetty http2 thread) in normal circumstances, from line 100 to line 124 (InputStreamResponseListener) should be run by the same jetty thread (invokeTask(task) in doProduce line 303 in EatWhatYouKill, the task is org.eclipse.jetty.io.ChannelEndPoint$1, the debugger does not advance to next line, which means from line 100 to line 124(InputStreamResponseListener), it's all handled by the same task within the same thread. again, this is what happening under "normal circumstances" with no hangs.

That means if thread (httpShardExecutor) is waiting at line 318, the the countdown latch (line 214) should have been unlocked by thread 2 at line 100, and assuming thread 2 keeps running, it should eventually unlock the Object lock at line 124 (which it did not happen).

While pausing the VM when the QA threads hang, it's found that thread 1 (httpShardExecutor) does hang at line 318, however the thread 2 (jetty http2 thread, h2sc-121-thread-1) is NOT executing any code with the org.eclipse.jetty.io.ChannelEndPoint$1 task invoked by invokeTask(task) in doProduce line 303 in EatWhatYouKill, instead, it's not even invoking at task, but waiting at produceTask() in doProduce line 184 in EatWhatYouKill. Therefore, this is different form the "normal circumstances" flow. As this time, the task somehow exit before reaching InputStreamResponseListener:124 that unlocks the Object lock.

This alone is not really a bug, as the jetty task org.eclipse.jetty.io.ChannelEndPoint$1 itself is non blocking (as frame org.eclipse.jetty.util.thread.Invocable.invokeNonBlocking(Invocable.java:69) suggests), which means it might exit (hard to find out what conditions tho) before reaching InputStreamResponseListener:124. The issue tho, is that somehow with such condition, it takes long time before thread h2sc-121-thread-1 goes back to handling org.eclipse.jetty.io.ChannelEndPoint$1 which eventually unlocks at InputStreamResponseListener:124 - It did happen, but could sometimes be hours after.

There has to be some weird timing and patterns in jetty thread that trigger such condition. In fact, if we run the load generator with 3 threads, get all of them to hang (might take at least 10+ mins before all of them hang). And then go into JVM, debug and pause the threads briefly and then resume, all those 3 hanging httpShardExecutor threads InputStreamResponseListener.java:318 will be unlocked by the jetty threads.

The 3 threads load almost simulates the same behaviors observed on our prod, except that when they resume, it would print out some exceptions, which is NOT observed in the originally reported issue (?)

2021-11-23 18:56:35.191 ERROR (qtp1426725223-3065) [c:fsloadtest  r:127.0.0.1 x:fsloadtest rid:2556] o.a.s.h.RequestHandlerBase org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at null: parsing error
        at org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:698)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:412)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:761)
        at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
        at org.apache.solr.handler.component.HttpShardHandler.request(HttpShardHandler.java:80)
        at org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:130)
        at org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:41)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.solr.common.SolrException: parsing error
        at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:53)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:696)
        ... 14 more
Caused by: java.nio.channels.AsynchronousCloseException
        at org.eclipse.jetty.http2.client.http.HttpConnectionOverHTTP2.close(HttpConnectionOverHTTP2.java:133)
        at org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2.onClose(HttpClientTransportOverHTTP2.java:170)
        at org.eclipse.jetty.http2.client.http.HttpClientTransportOverHTTP2$SessionListenerPromise.onClose(HttpClientTransportOverHTTP2.java:232)
        at org.eclipse.jetty.http2.api.Session$Listener.onClose(Session.java:206)
        at org.eclipse.jetty.http2.HTTP2Session.notifyClose(HTTP2Session.java:1166)
        at org.eclipse.jetty.http2.HTTP2Session.onClose(HTTP2Session.java:598)
        at org.eclipse.jetty.http2.HTTP2Session.onGoAway(HTTP2Session.java:477)
        at org.eclipse.jetty.http2.parser.Parser$Listener$Wrapper.onGoAway(Parser.java:396)
        at org.eclipse.jetty.http2.parser.BodyParser.notifyGoAway(BodyParser.java:192)
        at org.eclipse.jetty.http2.parser.GoAwayBodyParser.onGoAway(GoAwayBodyParser.java:169)
        at org.eclipse.jetty.http2.parser.GoAwayBodyParser.parse(GoAwayBodyParser.java:139)
        at org.eclipse.jetty.http2.parser.Parser.parseBody(Parser.java:198)
        at org.eclipse.jetty.http2.parser.Parser.parse(Parser.java:127)
        at org.eclipse.jetty.http2.HTTP2Connection$HTTP2Producer.produce(HTTP2Connection.java:261)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135)
        at org.eclipse.jetty.http2.HTTP2Connection.produce(HTTP2Connection.java:183)
        at org.eclipse.jetty.http2.HTTP2Connection.onFillable(HTTP2Connection.java:138)
        at org.eclipse.jetty.http2.HTTP2Connection$FillableCallback.succeeded(HTTP2Connection.java:361)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
        at org.eclipse.jetty.util.thread.Invocable.invokeNonBlocking(Invocable.java:69)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.invokeTask(EatWhatYouKill.java:348)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:303)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135)
        ... 4 more

patsonluk · 2021-11-23T19:48:44Z

From initial testing with 9.4.44.v20210927, the random hangs still happen 😭

noblepaul · 2021-11-23T23:14:40Z

We should use a combination of this and #106

chatman · 2021-11-29T17:05:48Z

I've finally managed to reproduce this problem, thanks @patsonluk.
Here's the config: http://sprunge.us/s5u81z
Here's the matchalldocs.json: http://sprunge.us/wb0ttN
Running the above with the ecommerce dataset (download link in README of github.com/fullstorydev/solr-bench) and code using ishan/qa-nodes branch.

Seems like this situation happens under heavy query load. On a 8 core (AMD Ryzen 5700G) machine, I was unable to reproduce this problem even after 15-20 minutes with 3 or 8 queries at a time. Increasing this to 16 queries at a time reproduced the problem. This looks like a case of resource exhaustion, Jetty not doing the right thing under such situations. Seems to me that #106 is a good workaround, since that patch will terminate such queries when system is under heavy load.

upgrade jetty to 9.4.44.v20210927

23fc8e8

noblepaul requested review from chatman and hiteshk25 November 23, 2021 09:56

chatman approved these changes Nov 23, 2021

View reviewed changes

noblepaul closed this Jun 16, 2022

noblepaul deleted the noble/upgrade-jetty branch June 16, 2022 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade jetty to 9.4.44.v20210927 #121

upgrade jetty to 9.4.44.v20210927 #121

noblepaul commented Nov 23, 2021 •

edited

chatman commented Nov 23, 2021

chatman commented Nov 23, 2021

chatman commented Nov 23, 2021

chatman commented Nov 23, 2021

chatman commented Nov 23, 2021

hiteshk25 commented Nov 23, 2021

patsonluk commented Nov 23, 2021 •

edited

patsonluk commented Nov 23, 2021

noblepaul commented Nov 23, 2021

chatman commented Nov 29, 2021 •

edited

upgrade jetty to 9.4.44.v20210927 #121

upgrade jetty to 9.4.44.v20210927 #121

Conversation

noblepaul commented Nov 23, 2021 • edited

chatman commented Nov 23, 2021

chatman commented Nov 23, 2021

chatman commented Nov 23, 2021

chatman commented Nov 23, 2021

chatman commented Nov 23, 2021

hiteshk25 commented Nov 23, 2021

patsonluk commented Nov 23, 2021 • edited

TLDR

Long version

patsonluk commented Nov 23, 2021

noblepaul commented Nov 23, 2021

chatman commented Nov 29, 2021 • edited

noblepaul commented Nov 23, 2021 •

edited

patsonluk commented Nov 23, 2021 •

edited

chatman commented Nov 29, 2021 •

edited