WebSocket transport improvements #1585

sbordet · 2023-12-20T15:16:12Z

CometD version(s)
6+

Description
Jetty issue jetty/jetty.project#11081 impacts CometD as well.

Following the discussion on that issue here.

sbordet · 2023-12-20T15:17:14Z

@darnap, a few follow-up questions below.

Collect data transfer statistics: actual bytes on the wire and time from enqueuing to sending, to detect client/network congestion.

This could be interesting to add to CometD.
However, I guess this will always be the number of uncompressed bytes, so if you have the permessage-deflate extension you won't really see the wire bytes.

Prioritize Reply messages to avoid cases in which we had client-side timeouts due to large message payloads.

Can you detail this?
CometD explicitly sends the /meta/connect reply after the messages, otherwise the session may be expired at the server side.

Apply back-pressure on the server side when we detect that the client is congested, using writeComplete to keep track of messages waiting to be sent. (We find this is more efficient than the ACK extension, as it avoids an extra RTT and we keep fewer pending messages in memory).

How do you apply backpressure to the server?
I ask because there may be a risk that one slow client blocks all the others, if not done correctly.

Thanks!

darnap · 2023-12-21T17:10:53Z

@darnap, a few follow-up questions below.

Collect data transfer statistics: actual bytes on the wire and time from enqueuing to sending, to detect client/network congestion.

This could be interesting to add to CometD. However, I guess this will always be the number of uncompressed bytes, so if you have the permessage-deflate extension you won't really see the wire bytes.

@sbordet
Yes, of course this would not account for compression. We found it useful nonetheless to measure how much data each client was requesting and processing.

Prioritize Reply messages to avoid cases in which we had client-side timeouts due to large message payloads.

Can you detail this? CometD explicitly sends the /meta/connect reply after the messages, otherwise the session may be expired at the server side.

On the client side we have strict timeouts on service channel requests, so we need to send a service reply immediately. There may be many outgoing messages to send to the client that could introduce excessive delay before replies can be sent, so we changed the priority. Furthermore messages can trigger heavy processing on the client, which may cause service timeouts to expire before it can get to the replies.

Apply back-pressure on the server side when we detect that the client is congested, using writeComplete to keep track of messages waiting to be sent. (We find this is more efficient than the ACK extension, as it avoids an extra RTT and we keep fewer pending messages in memory).

How do you apply backpressure to the server? I ask because there may be a risk that one slow client blocks all the others, if not done correctly.

The back-pressure signal is sent to the application, which then enacts various strategies to reduce the message load. Essentially, fewer messages are delivered to the CometD session (such as reducing update frequencies or notifying upstream applications of the congestion situation). We trigger the back-pressure condition based on the number and size of queued messages for which we have not yet received a writeComplete notification.

Thanks

sbordet · 2023-12-21T17:40:44Z

On the client side we have strict timeouts on service channel requests, so we need to send a service reply immediately. There may be many outgoing messages to send to the client that could introduce excessive delay before replies can be sent, so we changed the priority. Furthermore messages can trigger heavy processing on the client, which may cause service timeouts to expire before it can get to the replies.

So these reply messages are application replies, not CometD replies, right?

Marco85Li · 2023-12-22T10:54:27Z

@sbordet

On the client side we have strict timeouts on service channel requests, so we need to send a service reply immediately. There may be many outgoing messages to send to the client that could introduce excessive delay before replies can be sent, so we changed the priority. Furthermore messages can trigger heavy processing on the client, which may cause service timeouts to expire before it can get to the replies.

So these reply messages are application replies, not CometD replies, right?

They are CometD replies. When a client publishes a message on a Service Channel (implementation of org.cometd.server.AbstractService), a CometD reply is enqueued when the invoked method returns.

The current org.cometd.server.websocket.common.AbstractWebSocketEndPoint.Flusher serializes the CometD replies after the messages. Our client expects to receive the publish callback within a limited timeout so we swapped the order of the REPLIES and MESSAGES states.

CometD explicitly sends the /meta/connect reply after the messages, otherwise the session may be expired at the server side.

Can you please clarify why the session can expire on the server if replies (including the /meta/connect reply) are sent before the messages? What's the sequence of events that leads to expiration?

Thanks!

sbordet · 2023-12-22T11:19:27Z

The current org.cometd.server.websocket.common.AbstractWebSocketEndPoint.Flusher serializes the CometD replies after the messages. Our client expects to receive the publish callback within a limited timeout so we swapped the order of the REPLIES and MESSAGES states.

You should not to that, as it opens up for horrible races.

Can you please clarify why the session can expire on the server if replies (including the /meta/connect reply) are sent before the messages? What's the sequence of events that leads to expiration?

The CometD reply and some messages may be sent over the network, but then either TCP congestion or HTTP/2 flow control stall may happen.
At that point the server stalls the processing for that session.

However, the client has received the reply and thinks it can send further messages to the server.
However, the server won't process those messages because it has not finished the previous processing.
This means that a /meta/connect message from the client won't be processed, and the session will expire on server-side, but not on client-side (or it will later, but causing confusion about which message was not processed).

To worsen things, the new /meta/connect might be issued on a different connection, arriving at the server when a previous /meta/connect is still being processed, which causes the previous to be cancelled with associated loss of messages (unless ack extension).

Not to mention that if the TCP congestion resolves exactly at the same time the new /meta/connect arrives, now you have race conditions on the server with likely out-of-order messages being delivered.

In short do not do that.

We have seen these problems long time ago and since then fixed them, carefully writing CometD exactly to avoid these nasty, horrible to troubleshoot, problems so you should not change the CometD protocol.

May I suggest you to open a new issue (this one is more about the Jetty issue that caused loss of messages), detail exactly what is your problem and start a discussion on that new issue?
We can brainstorm the problem better and perhaps find an acceptable solution that does not break the CometD protocol, from which the community will benefit.

darnap · 2023-12-22T15:25:14Z

The reason the change was required initially was to deal with client requests that would immediately (i.e. in-stack with the notification) generate large response messages requiring more time to send than what the client-side maxNetworkDelay setting in CometD allowed, so the client considered the request failed since the CometD 'reply' would get enqueued after the application response.
We agree that this is not the proper way to deal with the problem, so we will review the application code to deal with this without messing with the CometD protocol.

We'll create a separate issue to discuss other possible improvements that could be generally useful.

Many thanks for discussing this issue with us.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebSocket transport improvements #1585

WebSocket transport improvements #1585

sbordet commented Dec 20, 2023

sbordet commented Dec 20, 2023

darnap commented Dec 21, 2023

sbordet commented Dec 21, 2023

Marco85Li commented Dec 22, 2023

sbordet commented Dec 22, 2023 •

edited

darnap commented Dec 22, 2023

WebSocket transport improvements #1585

WebSocket transport improvements #1585

Comments

sbordet commented Dec 20, 2023

sbordet commented Dec 20, 2023

darnap commented Dec 21, 2023

sbordet commented Dec 21, 2023

Marco85Li commented Dec 22, 2023

sbordet commented Dec 22, 2023 • edited

darnap commented Dec 22, 2023

sbordet commented Dec 22, 2023 •

edited