Jetty CPU utilization spike using Unix Domain Sockets handling HTTP2 connections #5090

janzantinge · 2020-07-28T19:01:46Z

Jetty version
9.4.28.v20200408

dependencies:

org.eclipse.jetty:jetty-server:9.4.28.v20200408 (*)
org.eclipse.jetty:jetty-unixsocket:9.4.28.v20200408
+--- org.eclipse.jetty:jetty-server:9.4.28.v20200408 (*)
\--- com.github.jnr:jnr-unixsocket:0.24 -> 0.28
      +--- com.github.jnr:jnr-ffi:2.1.12
      |    +--- com.github.jnr:jffi:1.2.23
      |    +--- org.ow2.asm:asm:7.1
      |    +--- org.ow2.asm:asm-commons:7.1
      |    |    +--- org.ow2.asm:asm:7.1
      |    |    +--- org.ow2.asm:asm-tree:7.1
      |    |    |    \--- org.ow2.asm:asm:7.1
      |    |    \--- org.ow2.asm:asm-analysis:7.1
      |    |         \--- org.ow2.asm:asm-tree:7.1 (*)
      |    +--- org.ow2.asm:asm-analysis:7.1 (*)
      |    +--- org.ow2.asm:asm-tree:7.1 (*)
      |    +--- org.ow2.asm:asm-util:7.1
      |    |    +--- org.ow2.asm:asm:7.1
      |    |    +--- org.ow2.asm:asm-tree:7.1 (*)
      |    |    \--- org.ow2.asm:asm-analysis:7.1 (*)
      |    +--- com.github.jnr:jnr-a64asm:1.0.0
      |    \--- com.github.jnr:jnr-x86asm:1.0.2
      +--- com.github.jnr:jnr-constants:0.9.15
      +--- com.github.jnr:jnr-enxio:0.25
      |    +--- com.github.jnr:jnr-constants:0.9.15
      |    \--- com.github.jnr:jnr-ffi:2.1.12 (*)
      \--- com.github.jnr:jnr-posix:3.0.54
      +--- com.github.jnr:jnr-ffi:2.1.12 (*)
            \--- com.github.jnr:jnr-constants:0.9.15

Java version
11.0.7

OS type/version
Linux 4.14.181-142.260.amzn2.x86_64 (EKS pod container)

Description
Jetty appears to be using a large amount of CPU (150% of a 1 CPU kubernetes allocation, 2 CPU limit, normal CPU utilization is around 40-50%) when handling requests received from a Unix Domain Socket. It appears as if a single thread is stuck in a tight loop. The number of active connections very slowly increases (up to 20 from 2 over the course of an hour) possibly suggesting an existing connection gets into a bad state once in while. The server is still able to handle requests just fine and appears to be operating normally.

We've traced this down to the jnr-enxio PollSelector where the libc poll call appears to be returning immediately every time. The theory is that the EatWhatYouKill producer/consumer strategy is not using the PollSelector in the way the PollSelector expects. A race condition seems to be occurring in which a file descriptor with data is not being handled by Jetty. FD(s) that always have data ready to be handled could explain why calls to poll never block.

The theory comes from data observed via Java Flight Recorder (A 2 minute profiling run on an affected container). The recording contains a ton of samples like this (the numbers on each line refer to the number of samples in which the trace appears):

HashMap$Node java.util.HashMap.newNode(int, Object, Object, HashMap$Node)	32677
   Object java.util.HashMap.putVal(int, Object, Object, boolean, boolean)	32677
   Object java.util.HashMap.put(Object, Object)	32629
      boolean java.util.HashSet.add(Object)	31988
         int jnr.enxio.channels.PollSelector.poll(long)	31876
            int jnr.enxio.channels.PollSelector.select()	31876
            boolean org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select()	31876
            Runnable org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce()	31876
            Runnable org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask()	31876
            boolean org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(boolean)	31876
            void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(boolean)	31876
            void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run()	31876
            void org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run()	31876
            void org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Runnable)	31876
            void org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()	31876
            void java.lang.Thread.run()	31876

There were also a bunch of samples that hit these stack traces:

Iterator java.util.HashMap$KeySet.iterator()	9192
   Iterator java.util.HashSet.iterator()	9192
   boolean org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select()	3193
      Runnable org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce()	3193
      Runnable org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask()	3193
      boolean org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(boolean)	3193
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(boolean)	3193
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run()	3193
      void org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run()	3193
      void org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Runnable)	3193
      void org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()	3193
      void java.lang.Thread.run()	3193
   int jnr.enxio.channels.PollSelector.poll(long)	3123
      int jnr.enxio.channels.PollSelector.select()	3123
      boolean org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select()	3123
      Runnable org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce()	3123
      Runnable org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask()	3123
      boolean org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(boolean)	3123
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(boolean)	3123
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run()	3123
      void org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run()	3123
      void org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Runnable)	3123
      void org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()	3123
      void java.lang.Thread.run()	3123
   void org.eclipse.jetty.io.ManagedSelector$SelectorProducer.updateKeys()	2726
      Runnable org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce()	2726
      Runnable org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask()	2726
      boolean org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(boolean)	2726
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(boolean)	2726
      void org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run()	2726
      void org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run()	2726
      void org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Runnable)	2726
      void org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()	2726
      void java.lang.Thread.run()	2726

The stack traces above appear in the largest share of the samples taken (excluding calls to void jdk.internal.misc.Unsafe.park(boolean, long) 21996).

LMK if you need more information.

The text was updated successfully, but these errors were encountered:

sbordet · 2020-07-28T19:26:24Z

Unix Domain Sockets support is experimental and not really recommended - JNR may not be that stable.

The fact that Jetty NIO continuously selects may mean that the JNR implementation returns incorrectly that there is an event for a socket when in reality there is none.

If you can reproduce, DEBUG logs would help understand.

Unix Domain Sockets are ok if you want to experiment - be prepared to bleed.

joakime · 2020-07-28T21:39:18Z

See also other past issues about jnr / unixsockets.

https://github.com/eclipse/jetty.project/issues?q=is%3Aissue+jnr

sbordet · 2021-08-04T20:59:30Z

Closing as JNR is phased out in favor of Java 16 Unix-Domain support.
See also #6043.

jwalker-highspot mentioned this issue Aug 13, 2020

Improper use of Servlet InputStream Async I/O results in 100% CPU #5153

Closed

joakime added Third Party Issues with third party libraries or projects Help Wanted labels Aug 13, 2020

sbordet closed this as completed Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jetty CPU utilization spike using Unix Domain Sockets handling HTTP2 connections #5090

Jetty CPU utilization spike using Unix Domain Sockets handling HTTP2 connections #5090

janzantinge commented Jul 28, 2020 •

edited

sbordet commented Jul 28, 2020

joakime commented Jul 28, 2020

sbordet commented Aug 4, 2021

Jetty CPU utilization spike using Unix Domain Sockets handling HTTP2 connections #5090

Jetty CPU utilization spike using Unix Domain Sockets handling HTTP2 connections #5090

Comments

janzantinge commented Jul 28, 2020 • edited

sbordet commented Jul 28, 2020

joakime commented Jul 28, 2020

sbordet commented Aug 4, 2021

janzantinge commented Jul 28, 2020 •

edited