Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heartbeat timeout of a session will cause the links of other sessions to be disconnected #461

Closed
d9e7381f opened this issue Jan 30, 2024 · 4 comments · Fixed by #507
Closed
Assignees
Labels
bug An issue describing a bug in the code
Milestone

Comments

@d9e7381f
Copy link

Version

2.6.0

Bug description

In my case, 20 connections were established using the same mina ssh client, 10 to A-server(mina ssh server) and the others to B-server(mina ssh server). For a period of time, the network from client to B-server was not smooth. At this time, we found that the session from client to A-server was also disconnected.
i got some log from A-server:
INFO - sshd-SshServer[150d80c4](port=8822)-timer-thread-1 [] c.s.p.t.s.s.CustomServerSessionImpl []: Disconnecting(CustomServerSessionImpl[classic-tx-beijing-01:hybrid01.classic-tx-beijing-01.8330@/10.53.4.116:12033]): SSH2_DISCONNECT_PROTOCOL_ERROR - Detected IdleTimeout after PT40.068S/PT40S ms.

I checked the mina ssh code and found that the same ssh client shares a thread for sending heartbeats. This will cause one session to send heartbeats and block other sessions from sending heartbeats, causing other sessions to be disconnected.

Actual behavior

In my case, 20 connections were established using the same mina ssh client, 10 to A-server(mina ssh server) and the others to B-server(mina ssh server). For a period of time, the network from client to B-server was not smooth. At this time, we found that the session from client to A-server was also disconnected.
i got some log from A-server:
INFO - sshd-SshServer[150d80c4](port=8822)-timer-thread-1 [] c.s.p.t.s.s.CustomServerSessionImpl []: Disconnecting(CustomServerSessionImpl[classic-tx-beijing-01:hybrid01.classic-tx-beijing-01.8330@/10.53.4.116:12033]): SSH2_DISCONNECT_PROTOCOL_ERROR - Detected IdleTimeout after PT40.068S/PT40S ms.

Expected behavior

When the network from client to B-server is not smooth, the channel from client to A-server should not be disconnected.

Relevant log output

client to B-server log:
WARN - sshd-timer-thread-1 [] c.SshClientCreator$4$1 []: sendHeartBeat(ClientSessionImpl[classic-tx-beijing-01:hybrid01.classic-tx-beijing-01.8330@aa.cloud/10.235.88.64:8800]) failed (IOException) to send heartbeat #7 request=keepalive@tunnel.mock.cn: Session is closed or closing while awaiting reply for request=keepalive@mock.cn
org.apache.sshd.common.SshException: DefaultConnectFuture[classic-tx-beijing-01.cloud:hybrid01.classic-tx-beijing-01.cloud.8329@aa.cloud/10.235.88.64:8800]: Failed to get operation result within specified timeout: 20000


A-server log:
INFO - sshd-SshServer[150d80c4](port=8822)-timer-thread-1 [] c.s.p.t.s.s.CustomServerSessionImpl []: Disconnecting(CustomServerSessionImpl[classic-tx-beijing-01:hybrid01.classic-tx-beijing-01.8330@/10.53.4.116:12033]): SSH2_DISCONNECT_PROTOCOL_ERROR - Detected IdleTimeout after PT40.068S/PT40S ms.

Other information

No response

@gnodet
Copy link
Contributor

gnodet commented Jan 30, 2024

If you have a heartbeat configured, you may want to check its configuration. The properties defined in CommonModuleProperties control the heartbeat: session-connection-heartbeat-type seems to be set to 40s in your case. You may also have a connect timeout set to 20s according to the client error message. You also have a custom heartbeat sender configured (sending keepalive@tunnel.mock.cn).
I would check this custom heartbeat, as it wonder if it does wait for a response, which should not be the case for a heartbeat (they usually just send a message without waiting for a response, just to keep things alive).

@d9e7381f
Copy link
Author

heartbeat code
image
According to the default configuration, the withReply variable seems to always be True

@tomaswolf
Copy link
Member

Sorry for the long message, but when I looked at this it took me surprisingly long to find my way through the code. I must say I find the implementation of heartbeats very confusing.

TL;DR: An immediate work-around for the reporter (@d9e7381f ) might be to set CoreModuleProperties.HEARTBEAT_REPLY_WAIT to zero. But we should change the implementation, too.

Long version:

First, Apache MINA SSHD resets the session idle timeout whenever a message is received or is written. So even sending a heartbeat with wantReply == false will reset the idle timeout.

So wantReply == true makes sense only if one wants to implement the OpenSSH behaviour of ClientAliveCountMax or ServerAliveCountMax: terminate the connection once that many heartbeats did not get a reply (yet). But for that one doesn't have to wait synchronously for the reply; we have asynchronous global requests since version 2.9.0.

Second, heartbeats are implemented twice: once in AbstractConnectionService and then again in ClientConnectionService, and there are two sets of configuration properties. The server side gets the behaviour of AbstractConnectionService.

AbstractConnectionService is configured with properties CommonModuleProperties.SESSION_HEARTBEAT_TYPE (default "NONE", i.e., off) and CommonModuleProperties.SESSION_HEARTBEAT_INTERVAL (default 0, i.e. disabled). To enable, one needs to set an interval > 0, and type "IGNORE" or "RESERVED". If the heartbeat type is "IGNORE", it'll send a fire-and-forget SSH_MSG_IGNORE. If the heartbeat type is "RESERVED", it'll invoke a user handler that is supposed to send the heartbeat, and it might choose to do so with a global or channel request with want-reply == true.

Heartbeats in ClientConnectionService have additional configuration properties CoreModuleProperties.HEARTBEAT_INTERVAL (default zero, i.e. off), CoreModuleProperties.HEARTBEAT_REQUEST (default "keepalive@sshd.apache.org"), and CoreModuleProperties.HEARTBEAT_REPLY_WAIT (default 5 minutes). To enable this kind of heartbeat, set an interval > 0. It'll send a SSH_MSG_GLOBAL_REQUEST for "keepalive@sshd.apache.org", and because HEARTBEAT_REPLY_WAIT > 0, will send it with wantReply == true and will wait synchronously for five minutes for a reply to arrive. If no reply arrives in time, kill the SSH session. (If HEARTBEAT_REQUEST_WAIT is <= 0, the request will be sent with wantReply == false, i.e., fire-and-forget.) And to complicate matters: if this "global request heartbeat" is off (HEARTBEAT_INTERVAL <= 0), the above behavior from AbstractConnectionService kicks in (which is by default disabled).

Plus there's an additional quirk: AbstractConnectionService does not send heartbeats when a key exchange is ongoing. That was done for SSHD-1059, and since the "strict KEX" extension ("Terrapin" mitigation) it's crucial because there must not be any SSH_MSG_IGNORE messges during the initial key exchange. The mechanism in ClientConnectionService has no such provision, but a SSH_MSG_GLOBAL_REQUEST message will be delayed anyway until an ongoing key exchange is over.

This is a mess. There should be only one set of properties governing heartbeats.

The two separate mechanisms seem to have existed for a long time, and changing that now would be an API break.

The 5 minutes default value comes from commit 14bbd54, which was about SSHD-1020. Reading that discussion the issue seems to have been a low-level network I/O read timeout (NIO2_READ_TIMEOUT) of ~10 min, and HEARTBEAT_REPLY_WAIT = 0. So even though the client sent keep-alives every 90 sec, and the SSH idle timeout got reset, the session got killed because nothing was received for 10min because the server never sent any reply to these heartbeats because none was asked for. The solution was to set NIO2_READ_TIMEOUT = 0 (no timeout for low-level I/O reads) and that huge HEARTBEAT_REPLY_WAIT.

An immediate work-around for the reporter might be to set CoreModuleProperties.HEARTBEAT_REPLY_WAIT to zero. That should work if there is no low-level read timeout (NIO2_READ_TIMEOUT = 0).

But we should re-think this mechanism anyway. I propose to change this to follow the path taken by OpenSSH:

  • Ignore CoreModuleProperties.HEARTBEAT_REPLY_WAIT (and deprecate it now).
  • Introduce a CoreModuleProperties.HEARTBEAT_NO_REPLY_MAX: an integer, as in OpenSSH's ServerAliveCountMax. I.e., maximum number of heartbeats to send without having gotten a reply.

If CoreModuleProperties.HEARTBEAT_NO_REPLY_MAX <= 0, send heartbeats with wantReply == false.

If > 0, increment a counter whenever a heartbeat is to be sent. If the counter is then > HEARTBEAT_NO_REPLY_MAX, throw an exception (killing the session). Otherwise send the heartbeat request (as an asynchronous global request) with wantReply == true and a handler (called when the reply is received) that resets the counter to zero.

@gnodet : does that sound reasonable to you?

tomaswolf added a commit to tomaswolf/mina-sshd that referenced this issue May 25, 2024
Switch from a timeout model to the OpenSSH model: fail if there
are more than a certain number of heartbeats for which no reply
was received yet.

Bug: apache#461
tomaswolf added a commit to tomaswolf/mina-sshd that referenced this issue May 27, 2024
Switch from a timeout model to the OpenSSH model: fail if there
are more than a certain number of heartbeats for which no reply
was received yet.

Bug: apache#461
@tomaswolf
Copy link
Member

Any comments on the PR #507? If none, I'll assume it's fine and will merge tomorrow.

tomaswolf added a commit to tomaswolf/mina-sshd that referenced this issue May 28, 2024
Switch from a timeout model to the OpenSSH model: fail if there
are more than a certain number of heartbeats for which no reply
was received yet.

Bug: apache#461
tomaswolf added a commit to tomaswolf/mina-sshd that referenced this issue May 29, 2024
Switch from a timeout model to the OpenSSH model: fail if there
are more than a certain number of heartbeats for which no reply
was received yet.

Bug: apache#461
@tomaswolf tomaswolf self-assigned this May 29, 2024
@tomaswolf tomaswolf modified the milestone: 2.13.0 May 29, 2024
@tomaswolf tomaswolf added bug An issue describing a bug in the code labels May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue describing a bug in the code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants