Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stuck tomcat http threads #1328

Open
maykov opened this issue Mar 14, 2023 · 18 comments
Open

stuck tomcat http threads #1328

maykov opened this issue Mar 14, 2023 · 18 comments
Labels

Comments

@maykov
Copy link

maykov commented Mar 14, 2023

CometD version(s)
5.0.14

Java version & vendor (use: java -version)
openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment Homebrew (build 11.0.15+0)
OpenJDK 64-Bit Server VM Homebrew (build 11.0.15+0, mixed mode)

Question
Hello!

When I run the same number of webscocket connections as I have a number of tomcat threads, all threads get stuck in WebSocketEndPoint::onMessage in completable.get90. Any ideas on what I can be doing wrong? When I run less simultaneous connections, threads are not getting stuck.

@sbordet
Copy link
Member

sbordet commented Mar 14, 2023

Probably Tomcat would need one more thread to run a task that would unblock a blocked connection, but there are no threads available, so it's blocked.

Jetty does not suffer from this problem, that we encountered in the past and fixed long time ago.

@maykov
Copy link
Author

maykov commented Mar 14, 2023

What would be a good workaround for this? Somehow reserve one extra thread?

@sbordet
Copy link
Member

sbordet commented Mar 14, 2023

You can use Jetty.

Configuring the thread pool with more threads works until you have more connections, at which point you are locked up again.

@maykov
Copy link
Author

maykov commented Mar 14, 2023

Do you mean, use jetty server instead of tomcat server? This might be hard as my organization is pretty set on using tomcat. Is there any other option?

@sbordet
Copy link
Member

sbordet commented Mar 14, 2023

Do you mean, use jetty server instead of tomcat server?

Yes.

Is there any other option?

Not that I know to get rid of the problem, if it is what I suspect.

You may reduce the probability of occurrence by tuning the thread pool, but you may always end up in this problem: imagine exactly all connections send a message at the same time, then all threads will be blocked in the send, leaving no thread to wake them up.
This case may seem rare, but if you have a global chat room it may happen quite often, every time a chat message needs to be delivered to all connections.

Have you asked the Tomcat project?

@maykov
Copy link
Author

maykov commented Mar 14, 2023

I will ask tomcat project as the next step then. Thank you so much for your help!

@maykov
Copy link
Author

maykov commented Mar 15, 2023

Do you have a pointer to the jetty fix which you had to do?

Do you know if this an issue specific to CometD5 or was it present on CometD 3?

@sbordet
Copy link
Member

sbordet commented Mar 15, 2023

Do you have a pointer to the jetty fix which you had to do?

No, it was a series of changes to the Jetty threading model that occurred over time.

Do you know if this an issue specific to CometD5 or was it present on CometD 3?

CometD 3 was buggy in this respect, since it was exiting the onMessage() method too early, possibly allowing for out of order message processing, and lack of backpressure to clients -- not good.

@maykov
Copy link
Author

maykov commented Mar 16, 2023

I debugged more and now I understand what you meant. The culprit is this piece of code:

_wsSession.getAsyncRemote().sendText(data, result -> { Throwable failure = result.getException(); if (failure == null) { callback.succeeded(); } else { callback.failed(failure); } })

The result callback will never get called if all tomcat threads are busy processing incoming message, i.e. waiting on promise.get().

So I dont understand, how Jetty addresses this problem. The only possible solution is to have sendText call lambda synchronously, ie before returning. Otherwise, it will require a thread to call the lambda. In a situation where all threads are busy waiting on promise.get, there are no threads available. A possible solution might be the one which never calls Promise.get(). Instead, the code should implement an async endpoint interface. Having said that, I dont know if such interface exists.

I'm thinking about doing a private fix by replacing getAsyncRemote with getBasicRemote. Do you think, it'd work?

@sbordet
Copy link
Member

sbordet commented Mar 16, 2023

The analysis is correct, but it is not the only cause.
In any place in CometD where you can do asynchronous processing you may have the same problem (not only in async writes).

So I dont understand, how Jetty addresses this problem.

The Jetty threading model never ends up with no threads available to perform critical operations such as unblocking a write or similar.
We blogged about in the past, see https://webtide.com/eat-what-you-kill-without-starvation/.

I'm thinking about doing a private fix by replacing getAsyncRemote with getBasicRemote. Do you think, it'd work?

It will not work.
If the async write does not return immediately, changing it to a blocking write won't solve the issue since it may never return either, so you are still locked up.
It may work if Tomcat uses fully blocking Socket API, but if you switch to that configuration, you lose in scalability (all the CometD long polls and such will block for a long time).
You already end up in blocking all your threads, which will be worsened if you use blocking Socket APIs.

@maykov
Copy link
Author

maykov commented Mar 17, 2023

I opened a bug for Tomcat on this: https://bz.apache.org/bugzilla/show_bug.cgi?id=66531

@maykov
Copy link
Author

maykov commented Mar 17, 2023

I've been thinking more about this. Whats the semantics on holding onMessage until all tasks are completed? Why should we let the client sit around while we're fanning out 10000 responses. Why do they care if we wrote bytes into 10000 sockets and a few of them failed? Tomcat is a servlet container and we should release tomcat threads as soon as possible.

Is anything bad going to happen if we remove .get() call from onMessage()?

@sbordet
Copy link
Member

sbordet commented Mar 17, 2023

The server must apply backpressure to the client because otherwise the client can easily flood the server with messages and the server will not be able to process them.

The fundamental problem is that of ordering: releasing backpressure too early will cause message processing on the server to be completely out of order, and that is really bad.

Other schemes that maintain order but not backpressure are subject to queueing, and it would be easy for any client to cause an enormous accumulation of messages that will blow up the server.

@maykov
Copy link
Author

maykov commented Mar 18, 2023

This makes sense. Having message processing on the server out of order might be not good.

Did CometD 3 have this issue and did it get fixed in CometD 4?

My team recently upgraded from v3 to v5 and we discovered this issue. Convincing the entire company to move from Tomcat to Jetty is off the table for now. We have 2 options:

  • go back to v3 without a clear path forward
  • "unfix" v5 by removing backpressure. Basically, trading a potential for a complete server lockup with possibility of messages being sent out of order. Neither option is good, but I would choose out of order messages over server lockup which requires container restart.

How hard would it be to go back to V3 behavior? Does removing promis.get() achieve this?

BTW, we reproduced the issue using Spring app. Its available here if you're interested in the repro: https://github.com/maykov/spring-cometd-websocket-test

@sbordet
Copy link
Member

sbordet commented Mar 19, 2023

Did CometD 3 have this issue and did it get fixed in CometD 4?

Possibly yes. CometD 3's API were synchronous (e.g. boolean return types from certain APIs), but the processing could always end up into a non-blocking write that may not complete synchronously.
In this case, a further message would have been processed and queued, so the order would have been kept, but lacking backpressure the server was exposed to infinite queueing.

In CometD 4, the CometD APIs were made asynchronous, so now the processing order became an issue, so we took the chance to fix both processing order and infinite queueing.

Other CometD users are using Tomcat in general, but using Jetty for CometD, so a mixed setup would not be strange.

Another option would be to guarantee to never consume all threads in the thread pool.
You would have to write some code to coordinate the incoming number of connections with the threads available in the pool.

@boris-petrov
Copy link

boris-petrov commented Mar 21, 2023

@maykov - does this Tomcat issue seem similar to your problem? I've had that issue for a very, very long time with Tomcat + CometD. However, with the last couple of Tomcat versions (9.0.70+ let's say) it seems to have disappeared. Which Tomcat version are you using?

P.S. I opened a CometD issue before opening one for Tomcat.

@maykov
Copy link
Author

maykov commented May 9, 2023

@boris-petrov yes, this looks very similar.

My issue is connected to Tomcat running out of threads. If I was able to increase the number of threads to some large number (1000), this would not be a problem. In your case, whats the number of tomcat threads? Are you able to increase it?

@boris-petrov
Copy link

I haven't changed the default. When I saw the issue, I just downgraded CometD to version 3 and used that for a while. Now, with the problem seemingly gone, I'm using CometD v6. With the latest version of Tomcat 9.0.74 there is a new issue though - hopefully resolved - I'm waiting on 9.0.75 to come out to try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants