Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jetty CPU utilization spike using Unix Domain Sockets handling HTTP2 connections #5090

Closed
janzantinge opened this issue Jul 28, 2020 · 3 comments
Labels
Help Wanted Third Party Issues with third party libraries or projects

Comments

@janzantinge
Copy link

janzantinge commented Jul 28, 2020

Jetty version
9.4.28.v20200408

dependencies:

org.eclipse.jetty:jetty-server:9.4.28.v20200408 (*)
org.eclipse.jetty:jetty-unixsocket:9.4.28.v20200408
+--- org.eclipse.jetty:jetty-server:9.4.28.v20200408 (*)
\--- com.github.jnr:jnr-unixsocket:0.24 -> 0.28
      +--- com.github.jnr:jnr-ffi:2.1.12
      |    +--- com.github.jnr:jffi:1.2.23
      |    +--- org.ow2.asm:asm:7.1
      |    +--- org.ow2.asm:asm-commons:7.1
      |    |    +--- org.ow2.asm:asm:7.1
      |    |    +--- org.ow2.asm:asm-tree:7.1
      |    |    |    \--- org.ow2.asm:asm:7.1
      |    |    \--- org.ow2.asm:asm-analysis:7.1
      |    |         \--- org.ow2.asm:asm-tree:7.1 (*)
      |    +--- org.ow2.asm:asm-analysis:7.1 (*)
      |    +--- org.ow2.asm:asm-tree:7.1 (*)
      |    +--- org.ow2.asm:asm-util:7.1
      |    |    +--- org.ow2.asm:asm:7.1
      |    |    +--- org.ow2.asm:asm-tree:7.1 (*)
      |    |    \--- org.ow2.asm:asm-analysis:7.1 (*)
      |    +--- com.github.jnr:jnr-a64asm:1.0.0
      |    \--- com.github.jnr:jnr-x86asm:1.0.2
      +--- com.github.jnr:jnr-constants:0.9.15
      +--- com.github.jnr:jnr-enxio:0.25
      |    +--- com.github.jnr:jnr-constants:0.9.15
      |    \--- com.github.jnr:jnr-ffi:2.1.12 (*)
      \--- com.github.jnr:jnr-posix:3.0.54
      +--- com.github.jnr:jnr-ffi:2.1.12 (*)
            \--- com.github.jnr:jnr-constants:0.9.15

Java version
11.0.7

OS type/version
Linux 4.14.181-142.260.amzn2.x86_64 (EKS pod container)

Description
Jetty appears to be using a large amount of CPU (150% of a 1 CPU kubernetes allocation, 2 CPU limit, normal CPU utilization is around 40-50%) when handling requests received from a Unix Domain Socket. It appears as if a single thread is stuck in a tight loop. The number of active connections very slowly increases (up to 20 from 2 over the course of an hour) possibly suggesting an existing connection gets into a bad state once in while. The server is still able to handle requests just fine and appears to be operating normally.

We've traced this down to the jnr-enxio PollSelector where the libc poll call appears to be returning immediately every time. The theory is that the EatWhatYouKill producer/consumer strategy is not using the PollSelector in the way the PollSelector expects. A race condition seems to be occurring in which a file descriptor with data is not being handled by Jetty. FD(s) that always have data ready to be handled could explain why calls to poll never block.

The theory comes from data observed via Java Flight Recorder (A 2 minute profiling run on an affected container). The recording contains a ton of samples like this (the numbers on each line refer to the number of samples in which the trace appears):

HashMap$Node java.util.HashMap.newNode(int, Object, Object, HashMap$Node)	32677
   Object java.util.HashMap.putVal(int, Object, Object, boolean, boolean)	32677
   Object java.util.HashMap.put(Object, Object)	32629
      boolean java.util.HashSet.add(Object)	31988
         int jnr.enxio.channels.PollSelector.poll(long)	31876
            int jnr.enxio.channels.PollSelector.select()	31876
            boolean org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select()	31876
            Runnable org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce()	31876
            Runnable org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask()	31876
            boolean org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(boolean)	31876
            void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(boolean)	31876
            void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run()	31876
            void org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run()	31876
            void org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Runnable)	31876
            void org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()	31876
            void java.lang.Thread.run()	31876

There were also a bunch of samples that hit these stack traces:

Iterator java.util.HashMap$KeySet.iterator()	9192
   Iterator java.util.HashSet.iterator()	9192
   boolean org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select()	3193
      Runnable org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce()	3193
      Runnable org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask()	3193
      boolean org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(boolean)	3193
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(boolean)	3193
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run()	3193
      void org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run()	3193
      void org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Runnable)	3193
      void org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()	3193
      void java.lang.Thread.run()	3193
   int jnr.enxio.channels.PollSelector.poll(long)	3123
      int jnr.enxio.channels.PollSelector.select()	3123
      boolean org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select()	3123
      Runnable org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce()	3123
      Runnable org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask()	3123
      boolean org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(boolean)	3123
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(boolean)	3123
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run()	3123
      void org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run()	3123
      void org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Runnable)	3123
      void org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()	3123
      void java.lang.Thread.run()	3123
   void org.eclipse.jetty.io.ManagedSelector$SelectorProducer.updateKeys()	2726
      Runnable org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce()	2726
      Runnable org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask()	2726
      boolean org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(boolean)	2726
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(boolean)	2726
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run()	2726
      void org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run()	2726
      void org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Runnable)	2726
      void org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()	2726
      void java.lang.Thread.run()	2726

The stack traces above appear in the largest share of the samples taken (excluding calls to void jdk.internal.misc.Unsafe.park(boolean, long) 21996).

LMK if you need more information.

@sbordet
Copy link
Contributor

sbordet commented Jul 28, 2020

Unix Domain Sockets support is experimental and not really recommended - JNR may not be that stable.

The fact that Jetty NIO continuously selects may mean that the JNR implementation returns incorrectly that there is an event for a socket when in reality there is none.

If you can reproduce, DEBUG logs would help understand.

Unix Domain Sockets are ok if you want to experiment - be prepared to bleed.

@joakime
Copy link
Contributor

joakime commented Jul 28, 2020

See also other past issues about jnr / unixsockets.

https://github.com/eclipse/jetty.project/issues?q=is%3Aissue+jnr

@sbordet
Copy link
Contributor

sbordet commented Aug 4, 2021

Closing as JNR is phased out in favor of Java 16 Unix-Domain support.
See also #6043.

@sbordet sbordet closed this as completed Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help Wanted Third Party Issues with third party libraries or projects
Projects
None yet
Development

No branches or pull requests

3 participants