Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconnectTimer is sometimes not being scheduled if heartbeat fails but the connection closes cleanly #4944

Closed
bartlomiej-korpus opened this issue Sep 13, 2022 · 1 comment

Comments

@bartlomiej-korpus
Copy link
Contributor

bartlomiej-korpus commented Sep 13, 2022

Environment

  • Elixir version (elixir -v): 1.14.0
  • Phoenix version (mix deps): 1.6.12
  • Operating system: macOS, Windows

Expected behavior

Connection is always retried after non-graceful disconnects such as these happening due to a heartbeat timeout.

Actual behavior

Sometimes when the heartbeat timeout interval is triggered(as I observe it: primarily due to poor network conditions) the connection will not be retried due to reconnectTimer not being scheduled correctly.

Full story

We run an app with a pretty data-heavy use case based on Phoenix Channels, which is often used by users with not-so-great internet connections. In these circumstances, we started to notice that sometimes the connection would be dropped and then never retried again which sometimes led to data loss on the client's end.

After investigation, I was able to pin it down to be caused by the disconnection status code used here:

heartbeatTimeout(){
if(this.pendingHeartbeatRef){
this.pendingHeartbeatRef = null
if(this.hasLogger()){ this.log("transport", "heartbeat timeout. Attempting to re-establish connection") }
this.abnormalClose("heartbeat timeout")
}
}

abnormalClose(reason){
this.closeWasClean = false
if(this.isConnected()){ this.conn.close(WS_CLOSE_NORMAL, reason) }
}

abnormalClose method sets the closeWasClean flag to false and then initiates connection closure using WS_CLOSE_NORMAL(1000) status code. This fragment of code clearly makes an assumption that the moment the heartbeat fails the network connection is no longer present/functioning and expects that the connection will be closed non-cleanly(as described in WebSockets RFC #4235 7.1.5.

This assumption is visible here:

onConnClose(event){
let closeCode = event && event.code
if(this.hasLogger()) this.log("transport", "close", event)
this.triggerChanError()
clearTimeout(this.heartbeatTimer)
if(!this.closeWasClean && closeCode !== 1000){
this.reconnectTimer.scheduleTimeout()
}
this.stateChangeCallbacks.close.forEach(([, callback]) => callback(event))
}

The code is expected to be different than 1000 and the connection is only then retried(reconnectTimer is set).

But then, the network connection might still be alive, and the 1000 closing code that was requested by abnormalClose might be properly negotiated with the server. When the bug I am describing happens, in this code, the flag closeWasClean is false, but the code is 1000 and we are left with a Socket in a broken state that's been closed non-cleanly(from the client's perspective because the heartbeat failed), but also cleanly (from the server's perspective because it received a closing frame with a 1000 status code and replied accordingly with an identical one) and Phoenix.js will never retry the connection.

Network conditions in which this becomes a problem might come from delays or from limited bandwidth in a data-heavy scenario - we experience both of these cases. The heartbeatInterval timer is started as soon as heartbeat frame is pushed to the WebSocket buffer, but this is not the same as it being actually sent. Other messages in front of it need to be transferred first.

I am able to reproduce the issue using macOS Network Link Conditioner tool and I am including a preview video with reproduction on a clean phx.new project with a heartbeatInterval adjusted to 5000 as reducing it makes the issue more prevalent.

Here is how it looks:

Screen.Recording.2022-09-13.at.16.41.44-3.mov

And this is what can be seen in the network log:

image

image

Packets at times 17276 and 20279 are respectively heartbeat response that we never observed on JavaScript side (opcode 1) and server's responding close frame.

For comparison, here is how it looks like for a properly closed connection in full-network-out scenarie(>99% cases) where closing code becomes 1006 due to the server never responding:

Screen.Recording.2022-09-13.at.16.52.26-2.mov

Proposed solution

I'd propose to adjust the code used in abnormalClose to be one in the range 4000-4999 as per RFC they are available for application-specific use cases. This would leave the 1000 code just for the clean-close scenarios. I was able to confirm this kind of fix to perform as expected.

Another possible solution might be to have a special flag in the client's state just for the failed heartbeat scenario, but I think reserving a special code here is appropriate.

I can happily prepare a PR if we agree this is indeed a bug in phoenix that's worth fixing.

This is my proposed solution that I was able to confirm works as expected.

Looks like the bug was introduced here: 470337d

and this change still accounts for that behaviour that we do not want to reconnect in case the connection was closed really cleanly.

@bartlomiej-korpus bartlomiej-korpus changed the title reconnectTimer is sometimes not being reset if heartbeat fails but the connection closes cleanly reconnectTimer is sometimes not being scheduled if heartbeat fails but the connection closes cleanly Sep 13, 2022
@bartlomiej-korpus
Copy link
Contributor Author

Fixed by #4921 , thanks to everyone involved! 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant