[Bug] Hang on opening JMX connection #1778

andrewazores · 2023-11-17T14:24:35Z

Current Behavior

Nov 17, 2023 2:17:07 PM org.slf4j.impl.JDK14LoggerAdapter fillCallerData
WARNING: Thread Thread[vert.x-worker-thread-12,5,main] has been blocked for 4491255 ms, time limit is 60000 ms
io.vertx.core.VertxException: Thread blocked
at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method)
at java.base@17.0.9/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.base@17.0.9/java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1864)
at java.base@17.0.9/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3465)
at java.base@17.0.9/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3436)
at java.base@17.0.9/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1898)
at java.base@17.0.9/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
at app//io.cryostat.net.TargetConnectionManager.executeConnectedTask(TargetConnectionManager.java:168)
at app//io.cryostat.net.web.http.api.v1.TargetRecordingsGetHandler.handleAuthenticated(TargetRecordingsGetHandler.java:124)
at app//io.cryostat.net.web.http.AbstractAuthenticatedRequestHandler.handle(AbstractAuthenticatedRequestHandler.java:102)
at app//io.cryostat.net.web.http.AbstractAuthenticatedRequestHandler.handle(AbstractAuthenticatedRequestHandler.java:72)
at app//io.vertx.ext.web.impl.BlockingHandlerDecorator.lambda$handle$0(BlockingHandlerDecorator.java:48)
at app//io.vertx.ext.web.impl.BlockingHandlerDecorator$$Lambda$1152/0x00007fec00a5fd40.handle(Unknown Source)
at app//io.vertx.core.impl.ContextBase.lambda$executeBlocking$0(ContextBase.java:137)
at app//io.vertx.core.impl.ContextBase$$Lambda$1084/0x00007fec009130b0.handle(Unknown Source)
at app//io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:264)
at app//io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:135)
at app//io.vertx.core.impl.ContextBase$$Lambda$1077/0x00007fec00917210.run(Unknown Source)
at app//io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76)
at app//io.vertx.core.impl.TaskQueue$$Lambda$179/0x00007fec001c6780.run(Unknown Source)
at java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base@17.0.9/java.lang.Thread.run(Thread.java:840)

Expected Behavior

If the JMX connection cannot be opened within the defined timeout, the connection attempt should be aborted.

This stack trace indicates that not only is the connection not aborted, but that the connection is being done directly on a Vertx workerpool thread instead of an additional application-managed thread as it probably should be.

Steps To Reproduce

No response

Environment

Cryostat v2.3.1 on OpenShift

Anything else?

Probably related:

The text was updated successfully, but these errors were encountered:

andrewazores · 2023-11-17T15:35:57Z

ebaron
I think the worst part of this is that it doesn't cause the liveness probe to fail. AppSRE doesn't seem to like us manually needing to restart pods. /health/liveness returns 204, but /health returns 504.

aazores
that's a good observation - /health does a little more in that it also tries to check on the status of the datasource/dashboard/reports sidecars, and therefore it runs on a worker thread from the same pool that is blocked by that original bug. /health/liveness just returns immediately on the vertx event loop thread. so what this means is that the event loop thread is still alive and unblocked, but there are no available unblocked worker threads to dispatch to for more complex requests

andrewazores · 2023-11-17T15:37:21Z

@ebaron it might therefore be useful to force /health/liveness to delegate off to a worker thread, even if it isn't technically necessary, just so that it can also evaluate whether that pool is actually responsive - since most of the actual useful API calls have to go through that layer. That doesn't fix the problem but at least it allows container management systems to better detect this case and perform a container restart.

andrewazores · 2023-11-17T16:51:53Z

Mistakenly closed by the previous PR - that is only a mitigation, it helps detect this case and allow container management systems to restart Cryostat, but does not actually prevent this from happening.

andrewazores added the bug Something isn't working label Nov 17, 2023

ebaron mentioned this issue Nov 17, 2023

Add parameter to allow scaling down of Cryostat RedHatInsights/insights-runtimes-inventory#158

Merged

andrewazores mentioned this issue Nov 17, 2023

fix(liveness): execute on worker pool #1779

Merged

7 tasks

andrewazores closed this as completed in #1779 Nov 17, 2023

andrewazores reopened this Nov 17, 2023

andrewazores mentioned this issue Jan 25, 2024

[Bug] Cannot delete custom targets that no longer point to a JVM #1866

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Hang on opening JMX connection #1778

[Bug] Hang on opening JMX connection #1778

andrewazores commented Nov 17, 2023 •

edited

andrewazores commented Nov 17, 2023

andrewazores commented Nov 17, 2023

andrewazores commented Nov 17, 2023

[Bug] Hang on opening JMX connection #1778

[Bug] Hang on opening JMX connection #1778

Comments

andrewazores commented Nov 17, 2023 • edited

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

andrewazores commented Nov 17, 2023

andrewazores commented Nov 17, 2023

andrewazores commented Nov 17, 2023

andrewazores commented Nov 17, 2023 •

edited