Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reconnecting websockets on heartbeat timeout #4921

Merged
merged 1 commit into from Oct 14, 2022

Conversation

laurglia
Copy link
Contributor

In Google Chrome, when network connection is interrupted, the websocket
connections break but they do not properly close. The following happens:

  • The error and close events are not triggered
  • Sending data with WebSocket#send works without errors even though
    the data is not actually sent
  • When calling WebSocket#close, the websocket is not closed.
    That is expected, as closing a websocket includes sending a closing
    handshake which cannot be done when the underlying connection is broken.
    The closing handshake is documented in the WebSockets
    Standard
    .

Because of the behavior described above, Phoenix failed to reconnect
sockets with WebSocket transport. When heartbeat timed out, WebSocket#close
was used to close the connection but since the connection was broken,
sending the closing handshake failed, and the close event was not
triggered.

It seems like that after some time, after around 2--3 minutes, the
WebSocket finally closes (the close event triggers) but that is too
long period for our use case.

To demonstrate the issue I have created a toy Phoenix project which sets
up a WebSocket connection and displays Phoenix Socket logs on the page.
The project is here: https://github.com/laurglia/phoenix-reconnection-issue

To reproduce the problem:

  1. Clone that repository
  2. Run the Phoenix server with mix phx.server in somewhere other than
    your local network
  3. Open the index route in Google Chrome
  4. Interrupt the network, for example, connect to another WiFi
  5. Observe that the Phoenix heartbeat fails and says that it is going to
    reconnect but in reality it does not immediately reconnect. You can use
    the "Ping" button on top of the page to also try to push a "ping" packet
    and observe that it is not sent.
  6. Only after around 1-3 minutes, the WebSocket connection is
    reestablished

To make testing easier, I have set up a server which runs the toy
project. You can access the server at http://34.219.19.64:4000/. Here is
also a video of me reproducing the issue: https://youtu.be/BP3rK06p0Ww

This commit fixes the issue by "tearing down" the WebSocket connection
after heartbeat timeout instead of just attempting to close it. The
teardown functionality does not necessarily close the connection but it
will remove all references to it, allowing a new one to be created in
place of the current one.

I also needed to make some changes to the teardown functionality.
Previously, it was not necessary to remove listeners from the
connections because once a connection was closed, it was not able for it
to emit any events. However, now since the connection may actually stay
open for some time and emit a close event after a few minutes, then that
could cause problems if the connection was replaced with a new
connection. That is why we now we set this.conn.onopen,
this.conn.onerror and other listeners to no-ops when tearing down the
connection.

On my test server, I am running the updated version of Phoenix on my toy
project. You can access the updated version here: http://34.219.19.64:4001/

Here is a video of me showing that the issue has been fixed in the
updated version: https://youtu.be/3t4nckli07k

@laurglia laurglia force-pushed the fix_reconnect_issue branch 2 times, most recently from 69190d9 to 76a435d Compare August 22, 2022 12:16
In Google Chrome, when network connection is interrupted, the websocket
connections break but they do not properly close. The following happens:
- The `error` and `close` events are not triggered
- Sending data with `WebSocket#send` works without errors even though
the data is not actually sent
- When calling `WebSocket#close`, the websocket is not closed.
That is expected, as closing a websocket includes sending a closing
handshake which cannot be done when the underlying connection is broken.
The closing handshake is documented in the [WebSockets
Standard][websockets-standard].

Because of the behavior described above, Phoenix failed to reconnect
sockets with WebSocket transport. When heartbeat timed out, `WebSocket#close`
was used to close the connection but since the connection was broken,
sending the closing handshake failed, and the close event was not
triggered.

It seems like that after some time, after around 2--3 minutes, the
WebSocket finally closes (the `close` event triggers) but that is too
long period for our use case.

To demonstrate the issue I have created a toy Phoenix project which sets
up a WebSocket connection and displays Phoenix Socket logs on the page.
The project is here: https://github.com/laurglia/phoenix-reconnection-issue

To reproduce the problem:
1. Clone that repository
2. Run the Phoenix server with `mix phx.server` in somewhere other than
your local network
3. Open the index route in Google Chrome
4. Interrupt the network, for example, connect to another WiFi
5. Observe that the Phoenix heartbeat fails and says that it is going to
reconnect but in reality it does not immediately reconnect. You can use
the "Ping" button on top of the page to also try to push a "ping" packet
and observe that it is not sent.
6. Only after around 1-3 minutes, the WebSocket connection is
reestablished

To make testing easier, I have set up a server which runs the toy
project. You can access the server at http://34.219.19.64:4000/. Here is
also a video of me reproducing the issue: https://youtu.be/BP3rK06p0Ww

This commit fixes the issue by "tearing down" the WebSocket connection
after heartbeat timeout instead of just attempting to close it. The
teardown functionality does not necessarily close the connection but it
will remove all references to it, allowing a new one to be created in
place of the current one.

I also needed to make some changes to the teardown functionality.
Previously, it was not necessary to remove listeners from the
connections because once a connection was closed, it was not able for it
to emit any events. However, now since the connection may actually stay
open for some time and emit a close event after a few minutes, then that
could cause problems if the connection was replaced with a new
connection. That is why we now we set `this.conn.onopen`,
`this.conn.onerror` and other listeners to no-ops when tearing down the
connection.

On my test server, I am running the updated version of Phoenix on my toy
project. You can access the updated version here: http://34.219.19.64:4001/

Here is a video of me showing that the issue has been fixed in the
updated version: https://youtu.be/3t4nckli07k

[websockets-standard]: https://websockets.spec.whatwg.org/#dom-websocket-close
@laurglia laurglia marked this pull request as ready for review August 22, 2022 12:27
@chrismccord
Copy link
Member

Thank you for such a detailed report and fix! I can verify that I reproduced this on Chrome, which is interesting because Safari is not affected. Give me some time to dig into the PR, but looks good at first glance. Thank you!

@laurglia
Copy link
Contributor Author

Posting here as well for clarity but I created one more pull request to fix another, but different bug related to reconnections: #4924

@guessthepw
Copy link

guessthepw commented Aug 25, 2022

@laurglia - this is amazing. I have been noticing this in my work app.

@chrismccord chrismccord merged commit 1a0e70a into phoenixframework:master Oct 14, 2022
@chrismccord
Copy link
Member

❤️❤️❤️🐥🔥

chrismccord pushed a commit that referenced this pull request Oct 26, 2022
In Google Chrome, when network connection is interrupted, the websocket
connections break but they do not properly close. The following happens:
- The `error` and `close` events are not triggered
- Sending data with `WebSocket#send` works without errors even though
the data is not actually sent
- When calling `WebSocket#close`, the websocket is not closed.
That is expected, as closing a websocket includes sending a closing
handshake which cannot be done when the underlying connection is broken.
The closing handshake is documented in the [WebSockets
Standard][websockets-standard].

Because of the behavior described above, Phoenix failed to reconnect
sockets with WebSocket transport. When heartbeat timed out, `WebSocket#close`
was used to close the connection but since the connection was broken,
sending the closing handshake failed, and the close event was not
triggered.

It seems like that after some time, after around 2--3 minutes, the
WebSocket finally closes (the `close` event triggers) but that is too
long period for our use case.

To demonstrate the issue I have created a toy Phoenix project which sets
up a WebSocket connection and displays Phoenix Socket logs on the page.
The project is here: https://github.com/laurglia/phoenix-reconnection-issue

To reproduce the problem:
1. Clone that repository
2. Run the Phoenix server with `mix phx.server` in somewhere other than
your local network
3. Open the index route in Google Chrome
4. Interrupt the network, for example, connect to another WiFi
5. Observe that the Phoenix heartbeat fails and says that it is going to
reconnect but in reality it does not immediately reconnect. You can use
the "Ping" button on top of the page to also try to push a "ping" packet
and observe that it is not sent.
6. Only after around 1-3 minutes, the WebSocket connection is
reestablished

To make testing easier, I have set up a server which runs the toy
project. You can access the server at http://34.219.19.64:4000/. Here is
also a video of me reproducing the issue: https://youtu.be/BP3rK06p0Ww

This commit fixes the issue by "tearing down" the WebSocket connection
after heartbeat timeout instead of just attempting to close it. The
teardown functionality does not necessarily close the connection but it
will remove all references to it, allowing a new one to be created in
place of the current one.

I also needed to make some changes to the teardown functionality.
Previously, it was not necessary to remove listeners from the
connections because once a connection was closed, it was not able for it
to emit any events. However, now since the connection may actually stay
open for some time and emit a close event after a few minutes, then that
could cause problems if the connection was replaced with a new
connection. That is why we now we set `this.conn.onopen`,
`this.conn.onerror` and other listeners to no-ops when tearing down the
connection.

On my test server, I am running the updated version of Phoenix on my toy
project. You can access the updated version here: http://34.219.19.64:4001/

Here is a video of me showing that the issue has been fixed in the
updated version: https://youtu.be/3t4nckli07k

[websockets-standard]: https://websockets.spec.whatwg.org/#dom-websocket-close
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants