Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InMemoryWebSessionStore indirectly causing infinite loop inside tomcat-native OpenSSL under load #26407

Closed
philsttr opened this issue Jan 19, 2021 · 45 comments
Labels
status: superseded An issue that has been superseded by another

Comments

@philsttr
Copy link

philsttr commented Jan 19, 2021

This is a fun one....

When load testing a new app using...

  • Spring Boot 2.4.2
  • WebFlux 5.3.3
  • Spring Security 5.4.2
  • Tomcat 9.0.41
  • tomcat-native 1.2.25
  • APR 1.6.5
  • OpenSSL 1.1.1f

...CPU utilization will max out, and stay maxed out even after the load test completes.

When investigating, I found that threads in the global boundedElastic Scheduler are consuming the entire CPU, as seen in top (broken out by threads)...

top - 17:56:42 up 12 min,  0 users,  load average: 0.61, 0.24, 0.15
Threads: 112 total,   3 running, 109 sleeping,   0 stopped,   0 zombie
%Cpu(s): 53.8 us,  0.3 sy,  0.0 ni, 45.7 id,  0.0 wa,  0.0 hi,  0.1 si,  0.1 st
MiB Mem :  11852.9 total,   8478.8 free,   1267.9 used,   2106.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  10278.6 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  212 bogus     20   0 6635460 493008  18064 R  99.9   4.1   0:07.44 boundedElastic-  <--- CPU maxed out
  170 bogus     20   0 6635460 493008  18064 R  99.7   4.1   0:21.02 boundedElastic-  <--- CPU maxed out
   17 bogus     20   0 6635460 493008  18064 S   9.6   4.1   0:26.19 C2 CompilerThre
  235 bogus     20   0 6635460 493008  18064 S   1.3   4.1   0:00.13 boundedElastic-
   18 bogus     20   0 6635460 493008  18064 S   0.7   4.1   0:04.61 C1 CompilerThre
  234 bogus     20   0 6635460 493008  18064 S   0.7   4.1   0:00.79 boundedElastic-
   85 bogus     20   0    9416   2340   1468 R   0.7   0.0   0:00.43 top
   36 bogus     20   0 6635460 493008  18064 S   0.3   4.1   0:00.39 https-openssl-n
   47 bogus     20   0 6635460 493008  18064 S   0.3   4.1   0:01.70 https-openssl-n
  166 bogus     20   0 6635460 493008  18064 S   0.3   4.1   0:00.77 parallel-4
  167 bogus     20   0 6635460 493008  18064 S   0.3   4.1   0:00.15 https-openssl-n
  175 bogus     20   0 6635460 493008  18064 S   0.3   4.1   0:00.11 https-openssl-n
  179 bogus     20   0 6635460 493008  18064 S   0.3   4.1   0:00.09 https-openssl-n
  184 bogus     20   0 6635460 493008  18064 S   0.3   4.1   0:00.11 https-openssl-n
  192 bogus     20   0 6635460 493008  18064 S   0.3   4.1   0:00.09 https-openssl-n
  204 bogus     20   0 6635460 493008  18064 S   0.3   4.1   0:00.09 https-openssl-n
  210 bogus     20   0 6635460 493008  18064 S   0.3   4.1   0:00.10 https-openssl-n
...

Taking a stackdump of the process reveals the two threads are inside tomcat's OpenSSLEngine...
(note that nid=0xd4 correlates to PID 212 above)

"boundedElastic-8" #86 daemon prio=5 os_prio=0 cpu=128128.31ms elapsed=215.53s allocated=28746K defined_classes=1 tid=0x00007fee00035800 nid=0xd4 runnable  [0x00007fed91cbe000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.tomcat.util.net.openssl.OpenSSLEngine.unwrap(OpenSSLEngine.java:603)
	- locked <0x00000007576d7df8> (a org.apache.tomcat.util.net.openssl.OpenSSLEngine)
	at javax.net.ssl.SSLEngine.unwrap(java.base@11.0.9.1/SSLEngine.java:637)
	at org.apache.tomcat.util.net.SecureNioChannel.read(SecureNioChannel.java:617)
	at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.fillReadBuffer(NioEndpoint.java:1229)
	at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.read(NioEndpoint.java:1141)
	at org.apache.coyote.http11.Http11InputBuffer.fill(Http11InputBuffer.java:795)
	at org.apache.coyote.http11.Http11InputBuffer.available(Http11InputBuffer.java:675)
	at org.apache.coyote.http11.Http11Processor.available(Http11Processor.java:1201)
	at org.apache.coyote.AbstractProcessor.isReadyForRead(AbstractProcessor.java:838)
	at org.apache.coyote.AbstractProcessor.action(AbstractProcessor.java:577)
	at org.apache.coyote.Request.action(Request.java:432)
	at org.apache.catalina.connector.InputBuffer.isReady(InputBuffer.java:305)
	at org.apache.catalina.connector.CoyoteInputStream.isReady(CoyoteInputStream.java:201)
	at org.springframework.http.server.reactive.ServletServerHttpRequest$RequestBodyPublisher.checkOnDataAvailable(ServletServerHttpRequest.java:295)
	at org.springframework.http.server.reactive.AbstractListenerReadPublisher.changeToDemandState(AbstractListenerReadPublisher.java:222)
	at org.springframework.http.server.reactive.AbstractListenerReadPublisher.access$1000(AbstractListenerReadPublisher.java:48)
	at org.springframework.http.server.reactive.AbstractListenerReadPublisher$State$2.request(AbstractListenerReadPublisher.java:333)
	at org.springframework.http.server.reactive.AbstractListenerReadPublisher$ReadSubscription.request(AbstractListenerReadPublisher.java:260)
	...

In a debug session, I discovered that an infinite loop is executing in tomcat's OpenSSLEngine.unwrap where:

  • pendingApp = 2
  • idx = 1
  • endOffset = 1
  • capacity = 16384

I would have expected the OpenSSL I/O code to execute on one of the https-openssl-nio-* threads, not the boundedElastic Scheduler. Therefore, I started investigating why this code was executing on the boundedElastic Scheduler.

After more debugging I narrowed it down to InMemoryWebSessionStore.createWebSession().
This is the only location in this particular app that uses the boundedElastic Scheduler.

The WebSession is being created because Spring Security's WebSessionServerRequestCache is being used, which persists requests in the WebSession.

If I disable the request cache (which removes the usage of WebSession, which removes the call to InMemoryWebSessionStore.createWebSession(), which removes usage of boundedElastic), then all I/O is performed on the https-openssl-nio-* threads, and the infinite loop does not occur.

I haven't fully investigated why the infinite loop occurs, but I assume there is a threadsafety bug somewhere in tomcat's OpenSSLEngine. (Either that or it was never intended to be used from multiple threads.) Having said that, I don't think that the I/O should be occurring on the boundedElastic thread, so I did not investigate further.

In other words, in my opinion, using InMemoryWebSessionStore should not cause the OpenSSL I/O to occur on a boundedElastic thread.

I have attached a simple application that can be used to reproduce the problem.
After extracting, use docker-compose up to build and start a container with the spring boot app with the above configuration.
Sending a lot of load (>= 2000 calls per second) to the /echo endpoint will reproduce the infinite loop.
However, you can see OpenSSL I/O occurring on the boundedElastic threads with any amount of load.

@philsttr
Copy link
Author

I just found spring-projects/spring-security#9200, which also discusses threading concerns when using InMemoryWebSessionStore

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Jan 19, 2021
@rstoyanchev
Copy link
Contributor

rstoyanchev commented Jan 20, 2021

The use of Schedulers.boundedElastic() was introduced in #24027 to avoid blocking in UUID#random reported by Blockhound. We could make it conditional on Schedulers#isInNonBlockingThread() which is false on Tomcat and it wouldn't cause Blockhound violations but as Tomcat uses a limited number of threads for NIO, the issues with potential blocking remains even if Blockhound would no longer report it as a violation.

In other words we have to get off Tomcat's thread and there is no way to get back on it that I can see, so one way or another this should probably be looked at in Tomcat, assuming the loop is in Tomcat code.

At best we could switch back to a thread marked as "non-blocking" via Schedulers.parallel(), which would help spring-projects/spring-security#9200 by not masking other issues with blocking, but it probably won't help with the issue here.

@philsttr
Copy link
Author

Yeah, it's a tricky situation since the JVM doesn't have a non-blocking way to retrieve random data.

I feel like the infinite loop is a pretty serious problem, since Spring Security + WebFlux + Tomcat + OpenSSL is a fairly common (I think?) combination, and it easily occurs "out-of-the-box".

I agree that we should probably get the tomcat folks to take a look. I just found https://bz.apache.org/bugzilla/show_bug.cgi?id=64771, which appears to be the same problem. Unfortunately it is currently closed due to inactivity. I'll comment there.

@markt-asf
Copy link

I'm unable to recreate this issue with the provided sample application. I'm running the app locally rather than via Docker, I'm using Tomcat Native 1.1.26 rather than 1.1.25 (should not impact this) and my local development build of OpenSSL 3.0.x. Load is running at > 6k req/s.
Threading issues are notoriously susceptible to small timing differences. I'll see if I can get anywhere via code inspection. I might also try with the OpenSSL 1.1.1 branch. Any suggestions on how to improve the reliability of the test case?
I do see occasional errors such as this:
2021-01-21 12:09:19.878 ERROR 25774 --- [nio-8443-exec-1] o.s.w.s.adapter.HttpWebHandlerAdapter : [63c84439] Error [java.io.IOException: The current thread was interrupted] for HTTP POST "/echo", but ServerHttpResponse already committed (200 OK)
That does make me wonder whether the application and/or the libraries it is using are following the requirements regarding thread safety set out in section 2.3.3.4 of the Servlet spec.

@markt-asf
Copy link

A code review strongly suggests that the root cause is the non-container thread attempting to read from the socket after the container has completed the request and recycled the various objects. This appears to be consistent with the error message in my previous comment.
One of the effects of recycling the request is that the buffer associated with the Http11InputBuffer has its limit and position set to zero. What I think is happening is:

  • at the start of unwrap() the single ByteBuffer is empty (pos=0, limit=16384)
  • the non-container thread records the capacity at line 517 as 16384
  • the container recycles the connection (capacity drops to zero but non-container thread is unaware of the change)
  • non-container thread enters an infinite loop around 569-606 due to the inconsistent state

@markt-asf
Copy link

Probably an outside chance but the root cause might be a bug in Tomcat's async error handling that has been fixed since 9.0.41. Since I can't repeat the failure with this test case, it might be worth trying to rebuild the test case with a 9.0.x snapshot from https://repository.apache.org/content/groups/snapshots/

@philsttr
Copy link
Author

philsttr commented Jan 21, 2021

Thanks for investigating @markt-asf ! I'll take a look at the snapshot, and I'll experiment more to try to narrow down how to best reproduce. As you can imagine, I was experimenting with lots of different variables, so there might be some cases where it reproduces more easily. I'll get back with my results.

@philsttr
Copy link
Author

philsttr commented Jan 21, 2021

I was able to reproduce this more easily with JSON payloads, whereas the original app used text/plain payloads. I don't think it is anything specific to JSON, but perhaps the serialization/deserialization adds subtle timing differences or something. Anyway, here's an updated project that handles JSON.

My load tests used the following request (note that any basic auth creds will work):

POST /echo HTTP/1.1
Host: yourhost:8443
Authorization: Basic Zm9vOmJhcg==
Accept: */*
Content-type: application/json
Content-Length: 22

{"echo":"Hello World"}

I haven't tried the latest tomcat snapshots yet. Will try that next.

@markt-asf
Copy link

Thanks but no success so far here. How long does the test need to run before you see the issue?

@philsttr
Copy link
Author

Usually reproduces within a few mins.

@philsttr
Copy link
Author

I was able to reproduce the infinite loop with the latest tomcat 9.0-SNAPSHOT (reported as 9.0.42-dev on startup). tomcat-ssl-snaphot.zip

In addition, in each run, the app would eventually stop responding to requests during the load test. On some runs, the exception below was reported. However, this exception is not reported during every run, even though the app stopped responding on every run. I did not encounter the exception or the app stopping responding with 9.0.41.

2021-01-21 19:34:21.615 ERROR 1 --- [nio-8443-exec-7] o.apache.coyote.http11.Http11Processor   : Error finishing request

java.lang.IllegalArgumentException: newPosition < 0: (-16362 < 0)
        at java.base/java.nio.Buffer.createPositionException(Buffer.java:318) ~[na:na]
        at java.base/java.nio.Buffer.position(Buffer.java:293) ~[na:na]
        at java.base/java.nio.ByteBuffer.position(ByteBuffer.java:1086) ~[na:na]
        at java.base/java.nio.ByteBuffer.position(ByteBuffer.java:262) ~[na:na]
        at org.apache.coyote.http11.Http11InputBuffer.endRequest(Http11InputBuffer.java:645) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.42-dev]
        at org.apache.coyote.http11.Http11Processor.endRequest(Http11Processor.java:1132) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.42-dev]
        at org.apache.coyote.http11.Http11Processor.dispatchEndRequest(Http11Processor.java:1093) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.42-dev]
        at org.apache.coyote.AbstractProcessor.dispatch(AbstractProcessor.java:263) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.42-dev]
        at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:59) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.42-dev]
        at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:887) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.42-dev]
        at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1691) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.42-dev]
        at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.42-dev]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.42-dev]
        at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]

Ideas on where to go from here? Can you try using the docker image I provided to ensure we're testing the exact same app? Any additional debugging I can do or data I can provide on my side?

@markt-asf
Copy link

I'm unable to reproduce this after 20 mins. That stack trace is consistent with not following Servlet 2.3.3.4. Given the simplicity of the application, this looks like a library issue at the moment rather than a container issue.

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Jan 25, 2021

Thanks for the details @markt-asf.

This is a WebFlux application where the Servlet calls startAsync, registers an ReadListener/WriteListener, and then handles in a Reactive Streams pipeline in stages. It is well protected in respect to Servlet 2.3.3.4 through the use of Project Reactor and our own Servlet to Reactive Streams bridge. Save for any bugs, there should be no concurrent access to the request or response, nor any access after the request completes.

As you said it is a basic scenario. Based on your analysis from Tomcat and mine at the framework level, I suspect a race between the request completing (perhaps with some exceptional reason), and an attempt to read the body as part of handling. @philsttr it could help to know what happens to the request (i.e. why is it complete when we are trying to read the body) perhaps through org.springframework.web and ``org.springframework.http` logging. Unfortunately I have not reproduced it yet either.

From the stacktrace in AbstractListenerReadPublisher i see a reactive streams signal request to read from the request body. This is within a state machine that would not allow this to happen if the request has already completed. However I am now wondering what happens, or rather what should happen, if this attempt to read crosses with a notification from the Servlet container that the request has ended somehow (e.g. connection lost?) where the latter arrives second.

In other words say the application is about to read from the request body and wants to call request.getInputStream().isReady(). Meanwhile it may receive an AsyncListener callback for an error. I don't see a way we can guarantee we will not call isReady() if the request completes in a parallel. A check for whether the request is complete just before the call to isReady would not be sufficient and still leaves the possibility for a race. What do you think?

@rstoyanchev
Copy link
Contributor

@philsttr note a recent change #26434 in this area of the code. It probably won't solve it but worth checking. Are you able to reproduce with more logging, e.g. org.springframework.web at DEBUG, to see what happens for such a request? I suspect a request is getting terminated from the client side and the connection getting closed races with the application trying to handle. There are also TRACE level loggers in the Servlet / Reactive Streams bridge, like the output (and log settings) under #26434.

@markt-asf I'm still interested in your feedback. In Servlet 3.1 non-blocking I/O, how can an application or framework guarantee no residual calls to ServletInputStream#isReady immediately after the request ends? We are generally protected against doing anything when the request has ended but in a race, the request could end immediately after such checks and before the call to isReady().

@markt-asf
Copy link

If isReady() returns true then the request has not ended. If the client then drops the connection that will only be detected when the app tries to read the ServletInputStream. The app will see the IOException and then onError() will fire shortly afterwards.
One issue I have seen is that the app catches and swallows the exception so the framework doesn't see it. The causes the framework to carry on reading when it shouldn't. That was with Spring MVC but I'm guessing a similar thing could happen here. The fix there was getting the app to re-throw the IOException.

@philsttr
Copy link
Author

@rstoyanchev I'll retry the test with more logging and the latest snapshots next week. FWIW, I do recall the test client receiving some errors. I'll gather/provide more info on them after the next test. Sorry for delay.. pulled off on other things. You know how it goes.

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Jan 29, 2021

@markt-asf the sequence I'm thinking about, and where I suspect a race, is where the request ends just before the call to isReady() so by the time we're in isReady(), the request has been recycled. Something like this in a tight sequence concurrently:

  1. Non-container thread grabs flag (internal state machine) so we know we have not received notification the request has ended, and is about to perform an isReady() check.
  2. At the same time a Container thread calls our AsyncListener#onError. So our internal state machine now advances to the final, completed state but the isReady() check has just missed that.
  3. Non-container thread is now inside isReady() where it runs into illegal state because the request has ended.

@markt-asf
Copy link

That should not be possible. The container only calls AsyncListener#onError after an IOException has been thrown that was visible to the application. An IOException will only happen as a result of reading the request or writing the response. So either the application knows an error has occurred and should not have called isReady() again or the application hasn't seen an error, calls isReady(), proceeds to read/write, then gets an IOException and then receives the call to AsyncListener#onError.
In your sequence above, 2 won't happen until after 3 and in 3 the non-container thread will see an IOException when it tries to read/write.

@rstoyanchev
Copy link
Contributor

Okay so in other words if the application does not attempt any I/O for a period of time, the container will not independently detect or notify of a lost connection?

One more question, what if an I/O error occurs while attempting to read or write from a non-container thread. In that case the exception will not propagate up to the container.

@markt-asf
Copy link

Certainly after isReady() has returned true. Before then, the container may detect an error but in that case isReady() should return false

In the case of a non-container thread, the container is aware of the error because it will have thrown it (from the InputStream or similar). The assumption is the non-container thread exists ASAP after the exception (without using the request, response etc) and the onError() event fires shortly afterwards on a container thread. You can get conflicts if the non-container thread continues to use the request, response etc objects after the exception.

@philsttr
Copy link
Author

philsttr commented Feb 5, 2021

@rstoyanchev I didn't get a chance this week to re-run the tests. My apologies. I haven't forgotten. Hopefully next week will be better.

@rstoyanchev
Copy link
Contributor

No problem. Same here, I was planning to have another look so not waiting on you.

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Feb 11, 2021

@philsttr one suggestion, you can enable detailed logging in the Servlet - Reactive Streams bridge:

logging:
  level:
    _org.springframework.http.server.reactive.AbstractListenerReadPublisher: TRACE
    _org.springframework.http.server.reactive.AbstractListenerWriteProcessor: TRACE
    _org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor: TRACE
    _org.springframework.http.server.reactive: TRACE
Successful Request Sample Output
2021-02-11 18:28:57.955 DEBUG 502665 --- [nio-8443-exec-5] o.s.w.s.adapter.HttpWebHandlerAdapter    : [2fdbe24a] HTTP POST "/echo"
2021-02-11 18:28:57.986 DEBUG 502665 --- [oundedElastic-1] o.s.w.s.s.DefaultWebSessionManager       : Created new WebSession.
2021-02-11 18:28:57.986 TRACE 502665 --- [nio-8443-exec-5] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] onDataAvailable
2021-02-11 18:28:57.990 DEBUG 502665 --- [oundedElastic-1] o.s.s.w.s.u.m.OrServerWebExchangeMatcher : Trying to match using PathMatcherServerWebExchangeMatcher{pattern='/logout', method=POST}
2021-02-11 18:28:57.990 DEBUG 502665 --- [oundedElastic-1] athPatternParserServerWebExchangeMatcher : Request 'POST /echo' doesn't match 'POST /logout'
2021-02-11 18:28:57.991 DEBUG 502665 --- [oundedElastic-1] o.s.s.w.s.u.m.OrServerWebExchangeMatcher : No matches found
2021-02-11 18:28:57.993 DEBUG 502665 --- [oundedElastic-1] o.s.s.w.s.u.m.OrServerWebExchangeMatcher : Trying to match using PathMatcherServerWebExchangeMatcher{pattern='/asterion/probe', method=null}
2021-02-11 18:28:57.993 DEBUG 502665 --- [oundedElastic-1] athPatternParserServerWebExchangeMatcher : Request 'POST /echo' doesn't match 'null /asterion/probe'
2021-02-11 18:28:57.993 DEBUG 502665 --- [oundedElastic-1] o.s.s.w.s.u.m.OrServerWebExchangeMatcher : No matches found
2021-02-11 18:28:57.993 DEBUG 502665 --- [oundedElastic-1] a.DelegatingReactiveAuthorizationManager : Checking authorization on '/echo' using org.springframework.security.authorization.AuthenticatedReactiveAuthorizationManager@124cf06b
2021-02-11 18:28:57.995 DEBUG 502665 --- [oundedElastic-1] o.s.s.w.s.a.AuthorizationWebFilter       : Authorization successful
2021-02-11 18:28:57.997 DEBUG 502665 --- [oundedElastic-1] s.w.r.r.m.a.RequestMappingHandlerMapping : [2fdbe24a] Mapped to com.example.demo.EchoController#echo(EchoPayload)
2021-02-11 18:28:58.005 DEBUG 502665 --- [oundedElastic-1] .r.m.a.RequestBodyMethodArgumentResolver : [2fdbe24a] Content-Type:application/json
2021-02-11 18:28:58.025 DEBUG 502665 --- [oundedElastic-1] .r.m.a.RequestBodyMethodArgumentResolver : [2fdbe24a] 0..1 [com.example.demo.EchoController$EchoPayload]
2021-02-11 18:28:58.030 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] UNSUBSCRIBED -> SUBSCRIBING
2021-02-11 18:28:58.031 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] 9223372036854775807 requested
2021-02-11 18:28:58.031 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] SUBSCRIBING -> DEMAND
2021-02-11 18:28:58.031 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] onDataAvailable
2021-02-11 18:28:58.031 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] DEMAND -> READING
2021-02-11 18:28:58.031 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] Read 13 bytes
2021-02-11 18:28:58.031 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] Publishing data read
2021-02-11 18:28:58.032 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] onAllDataRead
2021-02-11 18:28:58.032 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] READING -> COMPLETED
2021-02-11 18:28:58.040 TRACE 502665 --- [oundedElastic-1] o.s.http.codec.json.Jackson2JsonDecoder  : [2fdbe24a] Decoded [com.example.demo.EchoController$EchoPayload@320b2ea2]
2021-02-11 18:28:58.041 DEBUG 502665 --- [oundedElastic-1] com.example.demo.EchoController          : echo
2021-02-11 18:28:58.050 DEBUG 502665 --- [oundedElastic-1] o.s.w.r.r.m.a.ResponseBodyResultHandler  : [2fdbe24a] Using 'application/json' given [*/*] and supported [application/json, application/*+json, application/x-ndjson, text/event-stream]
2021-02-11 18:28:58.050 DEBUG 502665 --- [oundedElastic-1] o.s.w.r.r.m.a.ResponseBodyResultHandler  : [2fdbe24a] 0..1 [com.example.demo.EchoController$EchoPayload]
2021-02-11 18:28:58.053  INFO 502665 --- [oundedElastic-1] reactor.Mono.Delay.1                     : onSubscribe(MonoDelay.MonoDelayRunnable)
2021-02-11 18:28:58.054  INFO 502665 --- [oundedElastic-1] reactor.Mono.Delay.1                     : request(unbounded)
2021-02-11 18:28:58.055 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] No more data to read
2021-02-11 18:28:58.256  INFO 502665 --- [     parallel-1] reactor.Mono.Delay.1                     : onNext(0)
2021-02-11 18:28:58.256  INFO 502665 --- [     parallel-1] reactor.Mono.Delay.1                     : onComplete()
2021-02-11 18:28:58.258 TRACE 502665 --- [     parallel-1] o.s.http.codec.json.Jackson2JsonEncoder  : [2fdbe24a] Encoding [com.example.demo.EchoController$EchoPayload@320b2ea2]
2021-02-11 18:28:58.270 TRACE 502665 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [2fdbe24a] UNSUBSCRIBED -> REQUESTED
2021-02-11 18:28:58.271 TRACE 502665 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [2fdbe24a] Received onNext publisher
2021-02-11 18:28:58.271 TRACE 502665 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [2fdbe24a] REQUESTED -> RECEIVED
2021-02-11 18:28:58.273 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] UNSUBSCRIBED -> REQUESTED
2021-02-11 18:28:58.273 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] Item to write
2021-02-11 18:28:58.273 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] REQUESTED -> RECEIVED
2021-02-11 18:28:58.273 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] onWritePossible
2021-02-11 18:28:58.273 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] RECEIVED -> WRITING
2021-02-11 18:28:58.273 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] Wrote 13 of 13 bytes
2021-02-11 18:28:58.273 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] WRITING -> REQUESTED
2021-02-11 18:28:58.274 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] No more items to write
2021-02-11 18:28:58.274 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] REQUESTED -> RECEIVED
2021-02-11 18:28:58.274 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] onWritePossible
2021-02-11 18:28:58.274 TRACE 502665 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [2fdbe24a] RECEIVED -> COMPLETED
2021-02-11 18:28:58.274 TRACE 502665 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [2fdbe24a] UNSUBSCRIBED publishComplete
2021-02-11 18:28:58.275 TRACE 502665 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [2fdbe24a] UNSUBSCRIBED subscribe: org.springframework.http.server.reactive.AbstractListenerWriteFlushProcessor$State$WriteResultSubscriber@550faa4b
2021-02-11 18:28:58.275 TRACE 502665 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [2fdbe24a] SUBSCRIBING request: 9223372036854775807
2021-02-11 18:28:58.275 TRACE 502665 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [2fdbe24a] SUBSCRIBED publishComplete
2021-02-11 18:28:58.275 TRACE 502665 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [2fdbe24a] RECEIVED writeComplete
2021-02-11 18:28:58.275 TRACE 502665 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [2fdbe24a] Flush attempt
2021-02-11 18:28:58.277 TRACE 502665 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [2fdbe24a] RECEIVED -> REQUESTED
2021-02-11 18:28:58.277 TRACE 502665 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [2fdbe24a] Received onComplete
2021-02-11 18:28:58.277 TRACE 502665 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [2fdbe24a] REQUESTED -> COMPLETED
2021-02-11 18:28:58.277 TRACE 502665 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [2fdbe24a] UNSUBSCRIBED publishComplete
2021-02-11 18:28:58.277 TRACE 502665 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [2fdbe24a] UNSUBSCRIBED subscribe: reactor.core.publisher.MonoNext$NextSubscriber@3552501d
2021-02-11 18:28:58.277 TRACE 502665 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [2fdbe24a] SUBSCRIBING request: 9223372036854775807
2021-02-11 18:28:58.277 TRACE 502665 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [2fdbe24a] SUBSCRIBED publishComplete
2021-02-11 18:28:58.278 DEBUG 502665 --- [     parallel-1] o.s.w.s.adapter.HttpWebHandlerAdapter    : [2fdbe24a] Completed 200 OK
2021-02-11 18:28:58.278 TRACE 502665 --- [     parallel-1] o.s.h.s.r.ServletHttpHandlerAdapter      : [2fdbe24a] Handling completed
2021-02-11 18:28:58.279 TRACE 502665 --- [nio-8443-exec-6] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] onAllDataRead
2021-02-11 18:28:58.280 TRACE 502665 --- [nio-8443-exec-6] .s.r.AbstractListenerWriteFlushProcessor : [2fdbe24a] Received request to cancel
2021-02-11 18:28:58.280 TRACE 502665 --- [nio-8443-exec-6] .s.r.AbstractListenerWriteFlushProcessor : [2fdbe24a] Received onComplete

Each log message has a unique id so that as long you can identify one you can grep for all and then track what happened with the request. So that way if you capture the scenario, we can see what happens at that level.

@markt-asf, from the stack trace in the original comment, AbstractListenerReadPublisher.java:333 shows we're in the SUBSCRIBING state at the start of reading the body, right when the first Reactive Streams "request" signal is received. This is quite early on in request handling and our very first call to ServletInputStream#isRead. This makes it hard to imagine how we could have seen an I/O error at that point or done anything else to use the Servlet API incorrectly.

What I can say is for the scenario in general, and you'll see this in the log output (for a successful request) that I pasted, is that we start on a Tomcat thread where we call startAsync, register read/write listeners, and then in the WebFilter chain (similar to Servlet Filter chain but reactive style) switch to an application thread to absorb a blocking call to Random for a session id. In the mean time Tomcat calls onDataAvailable and a little later, the application thread starts reading right around here:

2021-02-11 18:28:58.030 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] UNSUBSCRIBED -> SUBSCRIBING
2021-02-11 18:28:58.031 TRACE 502665 --- [oundedElastic-1] _.s.h.s.r.AbstractListenerReadPublisher  : [2fdbe24a] 9223372036854775807 requested

In short, in this case we're likely to start reading on an application thread, if the first onDataAvailable arrives sooner before we get to reading.

@markt-asf
Copy link

We fixed a Tomcat bug yesterday that might be contributing (or possibly causing) this. The ReadListener.onError() event was not getting triggered if the client dropped the connection in some configurations. A fixed version of Tomcat 9 embedded is available from the ASF snapshot repo. If you can test with the latest snapshot that would be helpful.

rstoyanchev added a commit that referenced this issue Feb 26, 2021
This commit ensures handling is cancelled in case of onError/Timeout
callback from the Servlet container.

Separately we detect the same in ServletServerHttpRequest and
ServletServerHttpResponse, which signal onError to the read publisher
and cancel writing, but if the onError/Timeout arrives after reading
is done and before writing has started (e.g. longer handling), then
neither will reach handling.

See gh-26434, gh-26407
@rstoyanchev
Copy link
Contributor

@philsttr there has been a significant investigation under #26434 with fixes in both Tomcat and WebFlux that I believe have a very good chance of addressing the issue here. Would you be able to give it another try with Tomcat 9.0-SNASPHOT and Spring Framework 5.3.5-SNAPSHOT?

@philsttr
Copy link
Author

philsttr commented Mar 5, 2021

That's awesome news! Yes, I should be able to get back to this soon. Again sorry for delay.

@philsttr
Copy link
Author

philsttr commented Mar 10, 2021

I was finally able get some time today to work on this.

As a baseline, I first confirmed that I could reproduce what I originally reported using the original spring-boot/spring/tomcat versions.

Then, I upgraded the sample app that I included earlier to:

  • spring-boot 2.4.4-SNAPSHOT
  • spring-framework 5.3.5-SNAPSHOT
  • tomcat 9.0-SNAPSHOT

After upgrading, I was not able to reproduce the infinite loop. However, during the load tests, some new exceptions appeared on the server log, which seem related...

2021-03-10 01:53:01.617 ERROR 1 --- [undedElastic-24] reactor.core.publisher.Operators         : Operator called default onErrorDropped

java.lang.IllegalStateException: The state of the destination buffers changed concurrently while unwrapping bytes
	at org.apache.tomcat.util.net.openssl.OpenSSLEngine.unwrap(OpenSSLEngine.java:573) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at java.base/javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:637) ~[na:na]
	at org.apache.tomcat.util.net.SecureNioChannel.read(SecureNioChannel.java:642) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.fillReadBuffer(NioEndpoint.java:1343) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.read(NioEndpoint.java:1255) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at org.apache.coyote.http11.Http11InputBuffer.fill(Http11InputBuffer.java:794) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]

and

2021-03-10 01:53:01.664 ERROR 1 --- [nio-8443-exec-1] o.apache.coyote.http11.Http11Processor   : Error finishing request

java.lang.IllegalArgumentException: newPosition < 0: (-16064 < 0)
	at java.base/java.nio.Buffer.createPositionException(Buffer.java:318) ~[na:na]
	at java.base/java.nio.Buffer.position(Buffer.java:293) ~[na:na]
	at java.base/java.nio.ByteBuffer.position(ByteBuffer.java:1086) ~[na:na]
	at java.base/java.nio.ByteBuffer.position(ByteBuffer.java:262) ~[na:na]
	at org.apache.coyote.http11.Http11InputBuffer.endRequest(Http11InputBuffer.java:645) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]

Full stacks in tomcat-ssl-snapshot-log.txt.

My next step was to enable the additional logging requested by @rstoyanchev.
After enabling the logging, I unfortunately could not reproduce the first exception from above. Perhaps the additional logging affecting the timings.
I could reproduce the second exception from above, as seen in tomcat-ssl-snapshot-trace-log.txt.

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Mar 10, 2021

The ISE appears to be the defensive code added as part of https://bz.apache.org/bugzilla/show_bug.cgi?id=64771, so looks like the issue is still present but at least it doesn't cause a loop.

@markt-asf what happens if there is a connection issue very early? Let's say a ReadListener is registered but has not yet called isReady. If it is possible to receive onError at that stage, then I think I see how the issue can occur.

For the IAE, in the trace log I search for the last log message on the same thread nio-8443-exec-1 and then use its request id 404c008c to find the output for the request:

2021-03-10 02:13:03.414 TRACE 1 --- [undedElastic-36] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] UNSUBSCRIBED -> SUBSCRIBING
2021-03-10 02:13:03.414 TRACE 1 --- [undedElastic-36] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] request Long.MAX_VALUE
2021-03-10 02:13:03.414 TRACE 1 --- [undedElastic-36] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] SUBSCRIBING -> DEMAND
2021-03-10 02:13:03.414 TRACE 1 --- [nio-8443-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] onDataAvailable
2021-03-10 02:13:03.414 TRACE 1 --- [nio-8443-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] DEMAND -> READING
2021-03-10 02:13:03.414 TRACE 1 --- [nio-8443-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] Read 22 bytes
2021-03-10 02:13:03.414 TRACE 1 --- [nio-8443-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] Publishing DefaultDataBuffer
2021-03-10 02:13:03.414 TRACE 1 --- [nio-8443-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] No more to read
2021-03-10 02:13:03.414 TRACE 1 --- [nio-8443-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] READING -> DEMAND
2021-03-10 02:13:03.414 TRACE 1 --- [nio-8443-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] onAllDataRead [DEMAND]
2021-03-10 02:13:03.414 TRACE 1 --- [nio-8443-exec-4] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] DEMAND -> COMPLETED
2021-03-10 02:13:03.502 TRACE 1 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [404c008c] UNSUBSCRIBED -> REQUESTED
2021-03-10 02:13:03.502 TRACE 1 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [404c008c] onNext: "write" Publisher
2021-03-10 02:13:03.502 TRACE 1 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [404c008c] REQUESTED -> RECEIVED
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] UNSUBSCRIBED -> REQUESTED
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] onNext: DefaultDataBuffer
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] REQUESTED -> RECEIVED
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] onWritePossible
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] RECEIVED -> WRITING
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] Wrote 22 of 22 bytes
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] WRITING -> REQUESTED
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] onComplete [REQUESTED]
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] REQUESTED -> RECEIVED
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] onWritePossible
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.r.AbstractListenerWriteProcessor : [404c008c] RECEIVED -> COMPLETED
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [404c008c] [WP] completed [UNSUBSCRIBED]
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [404c008c] [WP] got subscriber ResponseBodyFlushProcessor-WriteResultSubscriber
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [404c008c] [WP] request Long.MAX_VALUE
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [404c008c] current "write" Publisher completed
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [404c008c] flushing
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [404c008c] RECEIVED -> REQUESTED
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [404c008c] onComplete [REQUESTED]
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] .s.r.AbstractListenerWriteFlushProcessor : [404c008c] REQUESTED -> COMPLETED
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [404c008c] [WFP] completed [UNSUBSCRIBED]
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [404c008c] [WFP] got subscriber reactor.core.publisher.MonoNext$NextSubscriber@257016cf
2021-03-10 02:13:03.503 TRACE 1 --- [     parallel-1] _.s.h.s.reactive.WriteResultPublisher    : [404c008c] [WFP] request Long.MAX_VALUE
2021-03-10 02:13:03.503 TRACE 1 --- [nio-8443-exec-1] _.s.h.s.r.AbstractListenerReadPublisher  : [404c008c] onAllDataRead [COMPLETED]
2021-03-10 02:13:03.503 TRACE 1 --- [nio-8443-exec-1] .s.r.AbstractListenerWriteFlushProcessor : [404c008c] cancel [COMPLETED]
2021-03-10 02:13:03.503 TRACE 1 --- [nio-8443-exec-1] .s.r.AbstractListenerWriteFlushProcessor : [404c008c] onComplete [COMPLETED]

2021-03-10 02:13:03.504 ERROR 1 --- [nio-8443-exec-1] o.apache.coyote.http11.Http11Processor   : Error finishing request

java.lang.IllegalArgumentException: newPosition < 0: (-16362 < 0)
	at java.base/java.nio.Buffer.createPositionException(Buffer.java:318) ~[na:na]
	at java.base/java.nio.Buffer.position(Buffer.java:293) ~[na:na]
	at java.base/java.nio.ByteBuffer.position(ByteBuffer.java:1086) ~[na:na]
	at java.base/java.nio.ByteBuffer.position(ByteBuffer.java:262) ~[na:na]
	at org.apache.coyote.http11.Http11InputBuffer.endRequest(Http11InputBuffer.java:645) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at org.apache.coyote.http11.Http11Processor.endRequest(Http11Processor.java:1132) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at org.apache.coyote.http11.Http11Processor.dispatchEndRequest(Http11Processor.java:1093) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at org.apache.coyote.AbstractProcessor.dispatch(AbstractProcessor.java:263) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:59) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:893) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1707) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) ~[tomcat-embed-core-9.0-SNAPSHOT.jar!/:9.0.45-dev]
	at java.base/java.lang.Thread.run(Thread.java:834) ~[na:na]

The output shows a request that has read its input and written its output without any issues. I don't actually see the completion message, for which logging for ServletHttpHandlerAdapter needs to be enabled, but the last 3 lines show that the completion was delegated to the reading and writing side which are in COMPLETED state.

I'm not 100% sure if the exception is for the exact same request. I'm basically assuming that when the IAE happened it was a continuation of request processing on the same thread, and the log messages are only a millisecond apart, but nevertheless there is a chance I suppose they are unrelated. It would be a very useful feature if Tomcat exposed a request id of some sort that it also uses for logging and exceptions, and that an application can also use so that log messages can be correlated more reliably, especially in such async scenarios.

@philsttr
Copy link
Author

I introduced an async log appender, and was able to reproduce the java.lang.IllegalStateException: The state of the destination buffers changed concurrently while unwrapping bytes.

I also enabled logging of ServletHttpHandlerAdapter.

See tomcat-ssl-snapshot-trace-log.2.txt

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Mar 10, 2021

Great. This is what I extract from for the failing request. I've inserted some comments:

2021-03-10 17:52:48.703 TRACE 1 --- [undedElastic-68] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] UNSUBSCRIBED -> SUBSCRIBING
2021-03-10 17:52:48.703 TRACE 1 --- [undedElastic-68] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] request Long.MAX_VALUE
2021-03-10 17:52:48.703 TRACE 1 --- [undedElastic-68] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] SUBSCRIBING -> DEMAND

# ISE happens here (on unboundedElastic-68) next but in the log it appears a millisecond later

2021-03-10 17:52:48.703 TRACE 1 --- [nio-8443-exec-2] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] onDataAvailable
2021-03-10 17:52:48.703 TRACE 1 --- [nio-8443-exec-2] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] DEMAND -> READING
2021-03-10 17:52:48.704 TRACE 1 --- [nio-8443-exec-2] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] Read 22 bytes
2021-03-10 17:52:48.704 TRACE 1 --- [nio-8443-exec-2] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] Publishing DefaultDataBuffer
2021-03-10 17:52:48.704 TRACE 1 --- [nio-8443-exec-2] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] No more to read
2021-03-10 17:52:48.704 TRACE 1 --- [nio-8443-exec-2] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] READING -> DEMAND
2021-03-10 17:52:48.704 TRACE 1 --- [nio-8443-exec-2] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] onAllDataRead [DEMAND]
2021-03-10 17:52:48.704 TRACE 1 --- [nio-8443-exec-2] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] DEMAND -> COMPLETED

# Reading is done here

2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] .s.r.AbstractListenerWriteFlushProcessor : [20c46a70] UNSUBSCRIBED -> REQUESTED
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] .s.r.AbstractListenerWriteFlushProcessor : [20c46a70] onNext: "write" Publisher
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] .s.r.AbstractListenerWriteFlushProcessor : [20c46a70] REQUESTED -> RECEIVED
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] UNSUBSCRIBED -> REQUESTED
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] onNext: DefaultDataBuffer
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] REQUESTED -> RECEIVED
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] onWritePossible
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] RECEIVED -> WRITING
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] Wrote 22 of 22 bytes
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] WRITING -> REQUESTED
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] onComplete [REQUESTED]
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] REQUESTED -> RECEIVED
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] onWritePossible
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.r.AbstractListenerWriteProcessor : [20c46a70] RECEIVED -> COMPLETED

// Writing is done

2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.reactive.WriteResultPublisher    : [20c46a70] [WP] completed [UNSUBSCRIBED]
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.reactive.WriteResultPublisher    : [20c46a70] [WP] got subscriber ResponseBodyFlushProcessor-WriteResultSubscriber
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.reactive.WriteResultPublisher    : [20c46a70] [WP] request Long.MAX_VALUE
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] .s.r.AbstractListenerWriteFlushProcessor : [20c46a70] current "write" Publisher completed
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] .s.r.AbstractListenerWriteFlushProcessor : [20c46a70] flushing
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] .s.r.AbstractListenerWriteFlushProcessor : [20c46a70] RECEIVED -> REQUESTED
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] .s.r.AbstractListenerWriteFlushProcessor : [20c46a70] onComplete [REQUESTED]
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] .s.r.AbstractListenerWriteFlushProcessor : [20c46a70] REQUESTED -> COMPLETED

// Flush after writing is done

2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.reactive.WriteResultPublisher    : [20c46a70] [WFP] completed [UNSUBSCRIBED]
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.reactive.WriteResultPublisher    : [20c46a70] [WFP] got subscriber reactor.core.publisher.MonoNext$NextSubscriber@58c96c42
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] _.s.h.s.reactive.WriteResultPublisher    : [20c46a70] [WFP] request Long.MAX_VALUE
2021-03-10 17:52:48.758 TRACE 1 --- [     parallel-2] o.s.h.s.r.ServletHttpHandlerAdapter      : [20c46a70] onComplete

# Request handling completed

2021-03-10 17:52:48.758 TRACE 1 --- [nio-8443-exec-7] _.s.h.s.r.AbstractListenerReadPublisher  : [20c46a70] onAllDataRead [COMPLETED]
2021-03-10 17:52:48.758 TRACE 1 --- [nio-8443-exec-7] .s.r.AbstractListenerWriteFlushProcessor : [20c46a70] cancel [COMPLETED]
2021-03-10 17:52:48.758 TRACE 1 --- [nio-8443-exec-7] .s.r.AbstractListenerWriteFlushProcessor : [20c46a70] onComplete [COMPLETED]

In the beginning we are on "unboundedElastic-68", which was probably used to create the WebFlux session id without blocking the Tomcat thread. We've just received a demand signal to start reading so we check isReady then leading to the ISE. In parallel, we get onDataAvailable from Tomcat and start reading. We complete reading and writing and then request handling completes as a normal request. The onErrorDropped then means the exception couldn't propagate downstream not because the downstream was cancelled but because the downstream had finished reading the request body in parallel, and hence that publisher was complete.

This is surprising because my assumption so far was that somehow there was an attempt to access a recycled request. However the output here shows the request completing normally and we even get callbacks from Tomcat along the way.

The only catch again might be that the ISE is not related to the previous log messages on the same thread but this seems unlikely because it would have been preceded by other log messages (before the isReady call) and also the stacktrace matches the preceding log message sequence.

@markt-asf
Copy link

Thanks for the additional analysis. I can see a code path that could cause difficulty. It would be helpful if someone who understands the detail of WebFlux can confirm if this is consistent with what is being seen here.

On a container thread the following happens:

  • Tomcat receives a new POST request
  • Tomcat passes the request to the appropriate Servlet
  • Servlet calls req.startAsync()
  • Servlet calls req.getInputStream().setReadListener(rl)
  • As part of the call to setReadListener() Tomcat checks to see if there is any data to be read and either calls onDataAvailable() if there is date to be read or registers interest in READ events so it is notified when data arrives.

If a separate thread calls req.getInputStream().isReady() while the final step above is in progress, it is possible that both threads will attempt to read data from the socket at the same time. Essentially, this could happen if the request is registered with some background / non-container thread before calling setReadListener() and then that background thread happens to call req.getInputStream().isReady() at just the wrong point.

I think this is a Tomcat bug but I want to double-check the language in the Servlet on concurrency / thread-safety around async processing in case there is anything there that would prohibit an application using the API this way.

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Mar 22, 2021

Those steps should happen before WebFlux handling. The Servlet does this in its service method:

  • AsyncContext asyncContext = request.startAsync()
    asyncContext.setTimeout(-1)
  • new ServletServerHttpRequest
    • request.getInputStream().setReadListener
  • new ServletServerHttpResponse
    • response.getOutputStream().setWriteListener
  • asyncContext.addListener
  • HttpHandler subscribe <= kick off WebFlux WebFilter-WebHandler chain, possibly and likely async

In other words WebFlux handling starts only after the read and write listeners and the async listener have been registered.

@markt-asf
Copy link

Thanks. I've dug a little further and there are other ways for concurrent calls to isReady() to cause problems. I'm still investigating potential fixes.

@markt-asf
Copy link

I am working on handling concurrent calls to isReady() but I have a follow-up question. In the stack trace reported by the OP, when the "unboundedElastic-68" thread calls isReady() is there any code path where that same thread could then try reading from the request? My concern is if isReady() on that thread returns true at the same time as onDataAvailable() is fired, there could then be two threads trying to read from the request.

@rstoyanchev
Copy link
Contributor

The request body is consumed through a Publisher exposed on our ServerHttpRequest#getBody and that is typically invoked once by the framework to prepare the @RequestBody argument for the controller method it is about to invoke. Or if that argument is declared as Mono<Foo> then the reading is triggered when the Mono is subscribed to. I expect there should be only a single thread reading the request body. The I/O is done with a state machine that further protects against racing between Reactive Streams (demand, cancel, onNext, etc) signals and callbacks from Tomcat. Hopefully this answers your question.

@markt-asf
Copy link

@rstoyanchev
That helps, thanks.

@philsttr
I've made some changes to Tomcat. The snapshot should be available in an hour or so. Please retest and report back.

@rstoyanchev
Copy link
Contributor

@philsttr I'm wondering if you've been able to check if this has been resolved, or did you run into #26834 while checking?

@philsttr
Copy link
Author

I haven't checked this one yet, unfortunately. Hope to get to it soon. I noticed that the changes to Tomcat have been released in 9.0.45, so I'll test with that version.

I ran into #26834 in a different app.

@philsttr
Copy link
Author

philsttr commented Apr 21, 2021

Good news! With tomcat 9.0.45, spring-boot 2.4.5, spring 5.3.6 I could not reproduce the original infinite loop, or the IllegalArgumentException/IllegalStateException, or any other exception during my load tests.

I also confirmed that I was still able to reproduce the original problem with the older versions, to make sure nothing had changed in my test setup.

So, that gives me confidence that this problem has been resolved in the latest versions.

Thank you so much @rstoyanchev and @markt-asf ! It's really impressive when engineers from three organizations can come together and fix what I consider to be a pretty complex problem like this. You two have been so responsive and helpful. I wish I could say the same for my responsiveness, but I hope you can forgive me.

@markt-asf
Copy link

Thanks for confirming the fix. No need to apologise - we all have different priorities to work with.

@rstoyanchev
Copy link
Contributor

I echo that, the time taken to report, investigate, and test is much appreciated. Thanks!

@rstoyanchev rstoyanchev added status: superseded An issue that has been superseded by another and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Apr 22, 2021
@rstoyanchev
Copy link
Contributor

I'm marking this as superseded by #26434 since the investigation for both went hand in hand, with the fixes for both released in 5.3.5.

@aslamsayyed
Copy link

@markt-asf
Curious to know if this change apache/tomcat@d4b340f#diff-23d7da71d93be5f16ed5c40f41f3132562ecbdfa27e60bed494e4786703c409a
fixed the issue

@markt-asf
Copy link

That may have been part of it but there were multiple fixes involved in both Tomcat and webflux. I don;t have an exact list.

@aslamsayyed
Copy link

aslamsayyed commented Jul 22, 2021

I am having same issue with CPU attached thread dump. Stack is made of Spring 4.3.0 not using WebFlux, running on Tomcat 8.5.39. Look like upgrade will help based on your commit so asked to confirm.

Thanks for reply @markt-asf

"https-openssl-nio-443-exec-10" # 126 daemon prio=5 os_prio=0 cpu=67095669.07ms elapsed=846595.29s tid=0x00007f46a991d000 nid=0x226c runnable [0x00007f463419d000]
java.lang.Thread.State: RUNNABLE
at jdk.internal.misc.Unsafe.setMemory0(java.base@11.0.11/Native Method)
at jdk.internal.misc.Unsafe.setMemory(java.base@11.0.11/Unsafe.java:724)
at jdk.internal.misc.Unsafe.setMemory(java.base@11.0.11/Unsafe.java:735)
at java.nio.DirectByteBuffer.(java.base@11.0.11/DirectByteBuffer.java:127)
at java.nio.ByteBuffer.allocateDirect(java.base@11.0.11/ByteBuffer.java:317)
at org.apache.tomcat.util.net.openssl.OpenSSLEngine.readPlaintextData(OpenSSLEngine.java:341)
at org.apache.tomcat.util.net.openssl.OpenSSLEngine.unwrap(OpenSSLEngine.java:594)
- locked <0x000000052a3d0680> (a org.apache.tomcat.util.net.openssl.OpenSSLEngine)
at javax.net.ssl.SSLEngine.unwrap(java.base@11.0.11/SSLEngine.java:637)
at org.apache.tomcat.util.net.SecureNioChannel.read(SecureNioChannel.java:586)
at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.fillReadBuffer(NioEndpoint.java:1287)
at org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper.read(NioEndpoint.java:1223)
at org.apache.coyote.http11.Http11InputBuffer.fill(Http11InputBuffer.java:731)
at org.apache.coyote.http11.Http11InputBuffer.parseRequestLine(Http11InputBuffer.java:368)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:684)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
- locked <0x00000005279812d0> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

Zoran0104 pushed a commit to Zoran0104/spring-framework that referenced this issue Aug 20, 2021
This commit ensures handling is cancelled in case of onError/Timeout
callback from the Servlet container.

Separately we detect the same in ServletServerHttpRequest and
ServletServerHttpResponse, which signal onError to the read publisher
and cancel writing, but if the onError/Timeout arrives after reading
is done and before writing has started (e.g. longer handling), then
neither will reach handling.

See spring-projectsgh-26434, spring-projectsgh-26407
lxbzmy pushed a commit to lxbzmy/spring-framework that referenced this issue Mar 26, 2022
This commit ensures handling is cancelled in case of onError/Timeout
callback from the Servlet container.

Separately we detect the same in ServletServerHttpRequest and
ServletServerHttpResponse, which signal onError to the read publisher
and cancel writing, but if the onError/Timeout arrives after reading
is done and before writing has started (e.g. longer handling), then
neither will reach handling.

See spring-projectsgh-26434, spring-projectsgh-26407
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: superseded An issue that has been superseded by another
Projects
None yet
Development

No branches or pull requests

5 participants