Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Hang on opening JMX connection #1778

Open
andrewazores opened this issue Nov 17, 2023 · 3 comments
Open

[Bug] Hang on opening JMX connection #1778

andrewazores opened this issue Nov 17, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@andrewazores
Copy link
Member

andrewazores commented Nov 17, 2023

Current Behavior

Nov 17, 2023 2:17:07 PM org.slf4j.impl.JDK14LoggerAdapter fillCallerData
WARNING: Thread Thread[vert.x-worker-thread-12,5,main] has been blocked for 4491255 ms, time limit is 60000 ms
io.vertx.core.VertxException: Thread blocked
at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method)
at java.base@17.0.9/java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.base@17.0.9/java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1864)
at java.base@17.0.9/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3465)
at java.base@17.0.9/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3436)
at java.base@17.0.9/java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1898)
at java.base@17.0.9/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2072)
at app//io.cryostat.net.TargetConnectionManager.executeConnectedTask(TargetConnectionManager.java:168)
at app//io.cryostat.net.web.http.api.v1.TargetRecordingsGetHandler.handleAuthenticated(TargetRecordingsGetHandler.java:124)
at app//io.cryostat.net.web.http.AbstractAuthenticatedRequestHandler.handle(AbstractAuthenticatedRequestHandler.java:102)
at app//io.cryostat.net.web.http.AbstractAuthenticatedRequestHandler.handle(AbstractAuthenticatedRequestHandler.java:72)
at app//io.vertx.ext.web.impl.BlockingHandlerDecorator.lambda$handle$0(BlockingHandlerDecorator.java:48)
at app//io.vertx.ext.web.impl.BlockingHandlerDecorator$$Lambda$1152/0x00007fec00a5fd40.handle(Unknown Source)
at app//io.vertx.core.impl.ContextBase.lambda$executeBlocking$0(ContextBase.java:137)
at app//io.vertx.core.impl.ContextBase$$Lambda$1084/0x00007fec009130b0.handle(Unknown Source)
at app//io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:264)
at app//io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:135)
at app//io.vertx.core.impl.ContextBase$$Lambda$1077/0x00007fec00917210.run(Unknown Source)
at app//io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76)
at app//io.vertx.core.impl.TaskQueue$$Lambda$179/0x00007fec001c6780.run(Unknown Source)
at java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base@17.0.9/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base@17.0.9/java.lang.Thread.run(Thread.java:840)

Expected Behavior

If the JMX connection cannot be opened within the defined timeout, the connection attempt should be aborted.

This stack trace indicates that not only is the connection not aborted, but that the connection is being done directly on a Vertx workerpool thread instead of an additional application-managed thread as it probably should be.

Steps To Reproduce

No response

Environment

  • Cryostat v2.3.1 on OpenShift

Anything else?

Probably related:

@andrewazores
Copy link
Member Author

ebaron
I think the worst part of this is that it doesn't cause the liveness probe to fail. AppSRE doesn't seem to like us manually needing to restart pods. /health/liveness returns 204, but /health returns 504.

aazores
that's a good observation - /health does a little more in that it also tries to check on the status of the datasource/dashboard/reports sidecars, and therefore it runs on a worker thread from the same pool that is blocked by that original bug. /health/liveness just returns immediately on the vertx event loop thread. so what this means is that the event loop thread is still alive and unblocked, but there are no available unblocked worker threads to dispatch to for more complex requests

@andrewazores
Copy link
Member Author

@ebaron it might therefore be useful to force /health/liveness to delegate off to a worker thread, even if it isn't technically necessary, just so that it can also evaluate whether that pool is actually responsive - since most of the actual useful API calls have to go through that layer. That doesn't fix the problem but at least it allows container management systems to better detect this case and perform a container restart.

@andrewazores
Copy link
Member Author

Mistakenly closed by the previous PR - that is only a mitigation, it helps detect this case and allow container management systems to restart Cryostat, but does not actually prevent this from happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant