Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket transport improvements #1585

Open
sbordet opened this issue Dec 20, 2023 · 6 comments
Open

WebSocket transport improvements #1585

sbordet opened this issue Dec 20, 2023 · 6 comments

Comments

@sbordet
Copy link
Member

sbordet commented Dec 20, 2023

CometD version(s)
6+

Description
Jetty issue jetty/jetty.project#11081 impacts CometD as well.

Following the discussion on that issue here.

@sbordet
Copy link
Member Author

sbordet commented Dec 20, 2023

@darnap, a few follow-up questions below.

Collect data transfer statistics: actual bytes on the wire and time from enqueuing to sending, to detect client/network congestion.

This could be interesting to add to CometD.
However, I guess this will always be the number of uncompressed bytes, so if you have the permessage-deflate extension you won't really see the wire bytes.

Prioritize Reply messages to avoid cases in which we had client-side timeouts due to large message payloads.

Can you detail this?
CometD explicitly sends the /meta/connect reply after the messages, otherwise the session may be expired at the server side.

Apply back-pressure on the server side when we detect that the client is congested, using writeComplete to keep track of messages waiting to be sent. (We find this is more efficient than the ACK extension, as it avoids an extra RTT and we keep fewer pending messages in memory).

How do you apply backpressure to the server?
I ask because there may be a risk that one slow client blocks all the others, if not done correctly.

Thanks!

@darnap
Copy link

darnap commented Dec 21, 2023

@darnap, a few follow-up questions below.

Collect data transfer statistics: actual bytes on the wire and time from enqueuing to sending, to detect client/network congestion.

This could be interesting to add to CometD. However, I guess this will always be the number of uncompressed bytes, so if you have the permessage-deflate extension you won't really see the wire bytes.

@sbordet
Yes, of course this would not account for compression. We found it useful nonetheless to measure how much data each client was requesting and processing.

Prioritize Reply messages to avoid cases in which we had client-side timeouts due to large message payloads.

Can you detail this? CometD explicitly sends the /meta/connect reply after the messages, otherwise the session may be expired at the server side.

On the client side we have strict timeouts on service channel requests, so we need to send a service reply immediately. There may be many outgoing messages to send to the client that could introduce excessive delay before replies can be sent, so we changed the priority. Furthermore messages can trigger heavy processing on the client, which may cause service timeouts to expire before it can get to the replies.

Apply back-pressure on the server side when we detect that the client is congested, using writeComplete to keep track of messages waiting to be sent. (We find this is more efficient than the ACK extension, as it avoids an extra RTT and we keep fewer pending messages in memory).

How do you apply backpressure to the server? I ask because there may be a risk that one slow client blocks all the others, if not done correctly.

The back-pressure signal is sent to the application, which then enacts various strategies to reduce the message load. Essentially, fewer messages are delivered to the CometD session (such as reducing update frequencies or notifying upstream applications of the congestion situation). We trigger the back-pressure condition based on the number and size of queued messages for which we have not yet received a writeComplete notification.

Thanks

@sbordet
Copy link
Member Author

sbordet commented Dec 21, 2023

On the client side we have strict timeouts on service channel requests, so we need to send a service reply immediately. There may be many outgoing messages to send to the client that could introduce excessive delay before replies can be sent, so we changed the priority. Furthermore messages can trigger heavy processing on the client, which may cause service timeouts to expire before it can get to the replies.

So these reply messages are application replies, not CometD replies, right?

@Marco85Li
Copy link

@sbordet

On the client side we have strict timeouts on service channel requests, so we need to send a service reply immediately. There may be many outgoing messages to send to the client that could introduce excessive delay before replies can be sent, so we changed the priority. Furthermore messages can trigger heavy processing on the client, which may cause service timeouts to expire before it can get to the replies.

So these reply messages are application replies, not CometD replies, right?

They are CometD replies. When a client publishes a message on a Service Channel (implementation of org.cometd.server.AbstractService), a CometD reply is enqueued when the invoked method returns.

The current org.cometd.server.websocket.common.AbstractWebSocketEndPoint.Flusher serializes the CometD replies after the messages. Our client expects to receive the publish callback within a limited timeout so we swapped the order of the REPLIES and MESSAGES states.

CometD explicitly sends the /meta/connect reply after the messages, otherwise the session may be expired at the server side.

Can you please clarify why the session can expire on the server if replies (including the /meta/connect reply) are sent before the messages? What's the sequence of events that leads to expiration?

Thanks!

@sbordet
Copy link
Member Author

sbordet commented Dec 22, 2023

The current org.cometd.server.websocket.common.AbstractWebSocketEndPoint.Flusher serializes the CometD replies after the messages. Our client expects to receive the publish callback within a limited timeout so we swapped the order of the REPLIES and MESSAGES states.

You should not to that, as it opens up for horrible races.

Can you please clarify why the session can expire on the server if replies (including the /meta/connect reply) are sent before the messages? What's the sequence of events that leads to expiration?

The CometD reply and some messages may be sent over the network, but then either TCP congestion or HTTP/2 flow control stall may happen.
At that point the server stalls the processing for that session.

However, the client has received the reply and thinks it can send further messages to the server.
However, the server won't process those messages because it has not finished the previous processing.
This means that a /meta/connect message from the client won't be processed, and the session will expire on server-side, but not on client-side (or it will later, but causing confusion about which message was not processed).

To worsen things, the new /meta/connect might be issued on a different connection, arriving at the server when a previous /meta/connect is still being processed, which causes the previous to be cancelled with associated loss of messages (unless ack extension).

Not to mention that if the TCP congestion resolves exactly at the same time the new /meta/connect arrives, now you have race conditions on the server with likely out-of-order messages being delivered.

In short do not do that.

We have seen these problems long time ago and since then fixed them, carefully writing CometD exactly to avoid these nasty, horrible to troubleshoot, problems so you should not change the CometD protocol.

May I suggest you to open a new issue (this one is more about the Jetty issue that caused loss of messages), detail exactly what is your problem and start a discussion on that new issue?
We can brainstorm the problem better and perhaps find an acceptable solution that does not break the CometD protocol, from which the community will benefit.

@darnap
Copy link

darnap commented Dec 22, 2023

The reason the change was required initially was to deal with client requests that would immediately (i.e. in-stack with the notification) generate large response messages requiring more time to send than what the client-side maxNetworkDelay setting in CometD allowed, so the client considered the request failed since the CometD 'reply' would get enqueued after the application response.
We agree that this is not the proper way to deal with the problem, so we will review the application code to deal with this without messing with the CometD protocol.

We'll create a separate issue to discuss other possible improvements that could be generally useful.

Many thanks for discussing this issue with us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants