Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nginx not closing persistent WebSocket connections of unavailable/unhealthy upstream server #1063

Open
rameshmurthy opened this issue Jul 28, 2021 · 7 comments

Comments

@rameshmurthy
Copy link

Hello Cometd Community,
This is not a problem with Cometd.

I am reaching out for help for a problem currently we are facing with Nginx (used as a load balancer and configured to support WebSockets).

Problem Statement: Looking for a configuration or setup that can sever persistent websocket connections by Nginx on a server that failed health check.

In our infrastructure Nginx is used as a load balancer. We have configured Nginx to support WebSockets.
We use Nginx to perform a health check on the servers and if any server fails the health check it will not route traffic to the un healthy server. This is working well for HTTP Traffic. The new HTTP Requests are getting routed to a healthy server. Similarly new WebSocket connections are routed to a healthy server. But Nginx is not closing the Persistent connections it has established for WebSocket traffic. This is leaving the existing WebSocket connections to send traffic to the unhealthy node. We reached out to Nginx for help, I am waiting to hear from them.

I want to reach out to CometD community to know whether you have faced any issues with Loadbalancers and WebSockets with respect to marking a server in the pool as unhealthy, even though the server is up, running and responding to cometd heartbeat messages (meta/connect).

Thanks in advance.

@sbordet
Copy link
Member

sbordet commented Jul 29, 2021

I don't have experience with nginx configured for this setup, have you tried HAProxy?

@rameshmurthy
Copy link
Author

Excuse me for the late response. I didnt get a chance to try HAProxy. I didnt find any documentation that states HAProxy closes persistent connections once the node fails health check. What I learned from my research is that the new traffic/connections will not be made to an unhealthy node.
The below blog has some good information about it.
https://aws.amazon.com/blogs/compute/using-websockets-and-load-balancers-part-two/

At this moment, I am trying to handle this at application level. Our application knows when we make it unavailable for traffic, in a way it makes it to fail health check so that LoadBalancer will mark it as unhealthy.

We were thinking to trigger a code path to drain the connections and make the clients to retry so that the load balancer will route the new requests to healthy nodes. We tried following approaches, approach 2 is working the way I want. But not sure whether that is the best approach to drain cometdsessions on a node.

  1. Disconnect all active sessions using disconnect method on ServerSession. This didnt work. I think it is sending meta/disconnect message to the client and the client will stop retrying.
@Override
public void disconnect() {
    boolean connected = _bayeux.removeServerSession(this, false);
    if (connected) {
        ServerMessage.Mutable message = _bayeux.newMessage();
        message.setSuccessful(true);
        message.setChannel(Channel.META_DISCONNECT);
        deliver(this, message);
        flush();
    }
}
  1. Wrote an Extension at the server to silence responding to meta connect so that the client will think that the server is not available/responding to its heart beat message and will retry after the timeout (in our environment the heart beat is 30 seconds and the time out is 10 seconds). When the client retries the load balancer is routing to the healthy nodes. The server which is silenced to not to respond to meta/connect will eventually drain all the sessions as it wont get a meta/connect from the client. This seems to work. I am not sure whether this is the best approach to drain cometd sessions on a node. But it is working.

Thanks

@sbordet
Copy link
Member

sbordet commented Aug 13, 2021

If you fail the health check from the application, meaning that there is no real problem (e.g. OS crash or network down), then naturally the WebSocket connections will stay open.

This means that if you fail the health check from the application, you should disconnect the clients using an application message.

What I would do would be the following:

  • Change state on the server, say "shutdown=true". This will fail the health check from the load balancer, but WebSocket connections are still open.
  • Send an application message to a broadcast non-clustered channel, say /shutdown.
  • All the clients connected to the failing server will receive the message on /shutdown and do: disconnect(r => handshake()). The disconnect() message will arrive to the failing server which will reply; the subsequent handshake() triggered by the disconnect reply will create a new WebSocket connection which will be routed to the sane server by the load balancer.

That should be enough. Note that the ServerSession on the failing server will be lost, so if it stores some state in memory, it will be lost. If you want some clustered state, you have to arrange it in your application, as CometD does not do it.

I think this solution is simpler than your solution 2 above; it's much faster as the disconnect+handshake are almost immediate (no need to wait 30+10 seconds) and it's fully under your control because you use an application message on /shutdown where you can even pass in additional information (e.g. a message for the user).

@rameshmurthy
Copy link
Author

rameshmurthy commented Aug 18, 2021

My understanding is when ever a client stops sending heart beat, server will clean up the session associated to the client. I am assuming the WebSocket connection will be closed.

With my approach I am simulating the long network failure to remove the session from server as client wont receive meta/connect response from server and it will wait for 30 seconds + 10 seconds time out before making a new request to server.

Frankly the server holding of WebSocket connections of clients which are not connected any more scares me.
I was in an impression the meta/connect (heartbeat) makes the client and server to determine which party is not available and can take necessary actions to clean up the connection/session. If this is not happening then I have to seriously consider your approach and also worry about clients (Browsers) who may keep it open for ever ( i do that all the time, i dont close my browser) and may close lid (screen)/lock the machine which simulates network disconnection (client wont send meta/disconnect to server).

I like your approach and I did think about it.

In our environment even though the application node knows when it should start failing health checks, but the load balancer is configured to consider the server is unavailable/unhealthy after it fails three health check. Each health check is configured for an interval of 8 seconds. Which means 24 seconds is needed for load balancer to determine an application node is un healthy. Even if we make the server to send shutdown message over a channel immediately after the node becomes unavailable for traffic the load balancer still may route the traffic to the same application node (sticky sessions).

I made some more optimization to my code by only silencing connections that are on WebSocket transport. For long polling the connection will move to healthy node as they are not persistent connections with respect to load balancer.

As I am going through the code, I see ServerSessionImpl.cancelSchedule that seems to do the same. It is cancelling the scheduler so that the client will not get any response for meta/connect. By the way we are using 3.1 version of CometD.

CancelSchedule is used in destroy method of CometDServlet

Thanks

@sbordet
Copy link
Member

sbordet commented Aug 18, 2021

My understanding is when ever a client stops sending heart beat, server will clean up the session associated to the client. I am assuming the WebSocket connection will be closed.

That's correct. However, my suggestion is that you should not be using a failure&cleanup mechanism to piggyback the failover mechanism from one server to another.
It is much better if you have full control of the failover.

Frankly the server holding of WebSocket connections of clients which are not connected any more scares me.

The server does not do that, I don't know where did you get that impression.

In our environment even though the application node knows when it should start failing health checks, but the load balancer is configured to consider the server is unavailable/unhealthy after it fails three health check.

Well then my solution is trivial to implement: wait for 3 failures, then send the application shutdown message. Immediate failover and you are done.

For long polling the connection will move to healthy node as they are not persistent connections with respect to load balancer.

I really hope this is not true, and you have persistent connection between the load balancer and the backend server.
I have seen server brought to their knees with this configuration and even if it's working for you it is highly inefficient.

I also recommend that you don't go too much into the CometD code to hook up to internal methods, because you will have a really hard time when you need to upgrade CometD.

I think it is possible for you to stick with the official, standard, CometD APIs to implement failover in a fully controlled and fast way.

By the way we are using 3.1 version of CometD.

Time to upgrade to CometD 5.0.x, see the versions table in this section of the documentation: https://docs.cometd.org/current5/reference/#_preface.

@rameshmurthy
Copy link
Author

rameshmurthy commented Aug 19, 2021

The server does not do that, I don't know where did you get that impression.

May be I mis understood your previous response. If I make the server not to respond to meta/connect, the client will wait for 30 seconds (heart beat) + 10 Seconds time out, it will send a meta/connect heart beat message on a new connection. In our environment we clean up a client session after 35 seconds from the time the client sent a meta/connect message to server. In a way the server will think that the client is not active, even though the server made the client not to be active by not responding to meta/connect.

**_If you fail the health check from the application, meaning that there is no real problem (e.g. OS crash or network down), then naturally the WebSocket connections will stay open.

This means that if you fail the health check from the application, you should disconnect the clients using an application message._**

I really hope this is not true, and you have persistent connection between the load balancer and the backend server.

For HTTP Traffic Nginx is configured to maintain a pool of persistent connection between LB and Backend server. These connections are shared with the incoming requests (multiplexing). But for WebSocket from client to server there is a persistent connection. For HTTP from LB to BackEnd servers it is persistent connection but not from client to LB. As LongPolling is over HTTP, it makes a request and once there is a response, it makes a new request to server through LB. As there is a sequence of request and response over HTTP, LoadBalancer gets an opportunity to route to healthy node. In WebSockets as there is a persistent connection the Nginx is not closing them if an application node fails health check. If a new connection/request comes it is able to route to healthy node. So, This problem is not applicable for HTTP Traffic (LongPolling).

We made a lot of customizations on top of CometD. Upgrading to a newer version will come with some cost. At this moment this is not prioritized. I have to make a case to the my management about the cost vs benefits. Is there any better way to figure out what are the new features added in 5 vs 3 or 4? This will be a separate ask for sure.

Ideally I want to do what you have proposed. I will try a quick spike. I am considering the cost of development as it needs some client side code changes as the client may not understand the new message (shutdown). One more challenge is the app node doesnt have any knowledge whether it failed a health check or not by LB. As that knowledge is not available to app node. I will give it a shot.

Thank you once again for the support.

@gkannan66235
Copy link

@rameshmurthy did you get any solution from nginx support?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants