You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Connection is always retried after non-graceful disconnects such as these happening due to a heartbeat timeout.
Actual behavior
Sometimes when the heartbeat timeout interval is triggered(as I observe it: primarily due to poor network conditions) the connection will not be retried due to reconnectTimer not being scheduled correctly.
Full story
We run an app with a pretty data-heavy use case based on Phoenix Channels, which is often used by users with not-so-great internet connections. In these circumstances, we started to notice that sometimes the connection would be dropped and then never retried again which sometimes led to data loss on the client's end.
After investigation, I was able to pin it down to be caused by the disconnection status code used here:
abnormalClose method sets the closeWasClean flag to false and then initiates connection closure using WS_CLOSE_NORMAL(1000) status code. This fragment of code clearly makes an assumption that the moment the heartbeat fails the network connection is no longer present/functioning and expects that the connection will be closed non-cleanly(as described in WebSockets RFC #4235 7.1.5.
The code is expected to be different than 1000 and the connection is only then retried(reconnectTimer is set).
But then, the network connection might still be alive, and the 1000 closing code that was requested by abnormalClose might be properly negotiated with the server. When the bug I am describing happens, in this code, the flag closeWasClean is false, but the code is 1000 and we are left with a Socket in a broken state that's been closed non-cleanly(from the client's perspective because the heartbeat failed), but also cleanly (from the server's perspective because it received a closing frame with a 1000 status code and replied accordingly with an identical one) and Phoenix.js will never retry the connection.
Network conditions in which this becomes a problem might come from delays or from limited bandwidth in a data-heavy scenario - we experience both of these cases. The heartbeatInterval timer is started as soon as heartbeat frame is pushed to the WebSocket buffer, but this is not the same as it being actually sent. Other messages in front of it need to be transferred first.
I am able to reproduce the issue using macOS Network Link Conditioner tool and I am including a preview video with reproduction on a clean phx.new project with a heartbeatInterval adjusted to 5000 as reducing it makes the issue more prevalent.
Here is how it looks:
Screen.Recording.2022-09-13.at.16.41.44-3.mov
And this is what can be seen in the network log:
Packets at times 17276 and 20279 are respectively heartbeat response that we never observed on JavaScript side (opcode 1) and server's responding close frame.
For comparison, here is how it looks like for a properly closed connection in full-network-out scenarie(>99% cases) where closing code becomes 1006 due to the server never responding:
Screen.Recording.2022-09-13.at.16.52.26-2.mov
Proposed solution
I'd propose to adjust the code used in abnormalClose to be one in the range 4000-4999 as per RFC they are available for application-specific use cases. This would leave the 1000 code just for the clean-close scenarios. I was able to confirm this kind of fix to perform as expected.
Another possible solution might be to have a special flag in the client's state just for the failed heartbeat scenario, but I think reserving a special code here is appropriate.
I can happily prepare a PR if we agree this is indeed a bug in phoenix that's worth fixing.
and this change still accounts for that behaviour that we do not want to reconnect in case the connection was closed really cleanly.
The text was updated successfully, but these errors were encountered:
bartlomiej-korpus
changed the title
reconnectTimer is sometimes not being reset if heartbeat fails but the connection closes cleanly
reconnectTimer is sometimes not being scheduled if heartbeat fails but the connection closes cleanly
Sep 13, 2022
Environment
Expected behavior
Connection is always retried after non-graceful disconnects such as these happening due to a heartbeat timeout.
Actual behavior
Sometimes when the heartbeat timeout interval is triggered(as I observe it: primarily due to poor network conditions) the connection will not be retried due to
reconnectTimer
not being scheduled correctly.Full story
We run an app with a pretty data-heavy use case based on Phoenix Channels, which is often used by users with not-so-great internet connections. In these circumstances, we started to notice that sometimes the connection would be dropped and then never retried again which sometimes led to data loss on the client's end.
After investigation, I was able to pin it down to be caused by the disconnection status code used here:
phoenix/assets/js/phoenix/socket.js
Lines 334 to 340 in a01d10b
phoenix/assets/js/phoenix/socket.js
Lines 517 to 520 in a01d10b
abnormalClose
method sets thecloseWasClean
flag to false and then initiates connection closure usingWS_CLOSE_NORMAL(1000)
status code. This fragment of code clearly makes an assumption that the moment the heartbeat fails the network connection is no longer present/functioning and expects that the connection will be closed non-cleanly(as described in WebSockets RFC #4235 7.1.5.This assumption is visible here:
phoenix/assets/js/phoenix/socket.js
Lines 392 to 401 in a01d10b
The code is expected to be different than
1000
and the connection is only then retried(reconnectTimer
is set).But then, the network connection might still be alive, and the
1000
closing code that was requested byabnormalClose
might be properly negotiated with the server. When the bug I am describing happens, in this code, the flagcloseWasClean
is false, but the code is1000
and we are left with aSocket
in a broken state that's been closed non-cleanly(from the client's perspective because the heartbeat failed), but also cleanly (from the server's perspective because it received a closing frame with a 1000 status code and replied accordingly with an identical one) and Phoenix.js will never retry the connection.Network conditions in which this becomes a problem might come from delays or from limited bandwidth in a data-heavy scenario - we experience both of these cases. The
heartbeatInterval
timer is started as soon asheartbeat
frame is pushed to the WebSocket buffer, but this is not the same as it being actually sent. Other messages in front of it need to be transferred first.I am able to reproduce the issue using macOS
Network Link Conditioner
tool and I am including a preview video with reproduction on a cleanphx.new
project with aheartbeatInterval
adjusted to5000
as reducing it makes the issue more prevalent.Here is how it looks:
Screen.Recording.2022-09-13.at.16.41.44-3.mov
And this is what can be seen in the network log:
Packets at times
17276
and20279
are respectively heartbeat response that we never observed on JavaScript side (opcode 1) and server's respondingclose
frame.For comparison, here is how it looks like for a properly closed connection in full-network-out scenarie(>99% cases) where closing code becomes
1006
due to the server never responding:Screen.Recording.2022-09-13.at.16.52.26-2.mov
Proposed solution
I'd propose to adjust the code used in
abnormalClose
to be one in the range4000-4999
as per RFC they are available for application-specific use cases. This would leave the1000
code just for the clean-close scenarios. I was able to confirm this kind of fix to perform as expected.Another possible solution might be to have a special flag in the client's state just for the
failed heartbeat
scenario, but I think reserving a special code here is appropriate.I can happily prepare a PR if we agree this is indeed a bug in phoenix that's worth fixing.
This is my proposed solution that I was able to confirm works as expected.
Looks like the bug was introduced here: 470337d
and this change still accounts for that behaviour that we do not want to reconnect in case the connection was closed really cleanly.
The text was updated successfully, but these errors were encountered: