reconnectTimer is sometimes not being scheduled if heartbeat fails but the connection closes cleanly #4944

bartlomiej-korpus · 2022-09-13T15:46:01Z

Environment

Elixir version (elixir -v): 1.14.0
Phoenix version (mix deps): 1.6.12
Operating system: macOS, Windows

Expected behavior

Connection is always retried after non-graceful disconnects such as these happening due to a heartbeat timeout.

Actual behavior

Sometimes when the heartbeat timeout interval is triggered(as I observe it: primarily due to poor network conditions) the connection will not be retried due to reconnectTimer not being scheduled correctly.

Full story

We run an app with a pretty data-heavy use case based on Phoenix Channels, which is often used by users with not-so-great internet connections. In these circumstances, we started to notice that sometimes the connection would be dropped and then never retried again which sometimes led to data loss on the client's end.

After investigation, I was able to pin it down to be caused by the disconnection status code used here:

phoenix/assets/js/phoenix/socket.js

Lines 334 to 340 in a01d10b

    
           heartbeatTimeout(){ 
        
             if(this.pendingHeartbeatRef){ 
        
               this.pendingHeartbeatRef = null 
        
               if(this.hasLogger()){ this.log("transport", "heartbeat timeout. Attempting to re-establish connection") } 
        
               this.abnormalClose("heartbeat timeout") 
        
             } 
        
           }

phoenix/assets/js/phoenix/socket.js

Lines 517 to 520 in a01d10b

    
           abnormalClose(reason){ 
        
             this.closeWasClean = false 
        
             if(this.isConnected()){ this.conn.close(WS_CLOSE_NORMAL, reason) } 
        
           }

abnormalClose method sets the closeWasClean flag to false and then initiates connection closure using WS_CLOSE_NORMAL(1000) status code. This fragment of code clearly makes an assumption that the moment the heartbeat fails the network connection is no longer present/functioning and expects that the connection will be closed non-cleanly(as described in WebSockets RFC #4235 7.1.5.

This assumption is visible here:

phoenix/assets/js/phoenix/socket.js

Lines 392 to 401 in a01d10b

    
           onConnClose(event){ 
        
             let closeCode = event && event.code 
        
             if(this.hasLogger()) this.log("transport", "close", event) 
        
             this.triggerChanError() 
        
             clearTimeout(this.heartbeatTimer) 
        
             if(!this.closeWasClean && closeCode !== 1000){ 
        
               this.reconnectTimer.scheduleTimeout() 
        
             } 
        
             this.stateChangeCallbacks.close.forEach(([, callback]) => callback(event)) 
        
           }

The code is expected to be different than 1000 and the connection is only then retried(reconnectTimer is set).

But then, the network connection might still be alive, and the 1000 closing code that was requested by abnormalClose might be properly negotiated with the server. When the bug I am describing happens, in this code, the flag closeWasClean is false, but the code is 1000 and we are left with a Socket in a broken state that's been closed non-cleanly(from the client's perspective because the heartbeat failed), but also cleanly (from the server's perspective because it received a closing frame with a 1000 status code and replied accordingly with an identical one) and Phoenix.js will never retry the connection.

Network conditions in which this becomes a problem might come from delays or from limited bandwidth in a data-heavy scenario - we experience both of these cases. The heartbeatInterval timer is started as soon as heartbeat frame is pushed to the WebSocket buffer, but this is not the same as it being actually sent. Other messages in front of it need to be transferred first.

I am able to reproduce the issue using macOS Network Link Conditioner tool and I am including a preview video with reproduction on a clean phx.new project with a heartbeatInterval adjusted to 5000 as reducing it makes the issue more prevalent.

Here is how it looks:

Screen.Recording.2022-09-13.at.16.41.44-3.mov

And this is what can be seen in the network log:

Packets at times 17276 and 20279 are respectively heartbeat response that we never observed on JavaScript side (opcode 1) and server's responding close frame.

For comparison, here is how it looks like for a properly closed connection in full-network-out scenarie(>99% cases) where closing code becomes 1006 due to the server never responding:

Screen.Recording.2022-09-13.at.16.52.26-2.mov

Proposed solution

I'd propose to adjust the code used in abnormalClose to be one in the range 4000-4999 as per RFC they are available for application-specific use cases. This would leave the 1000 code just for the clean-close scenarios. I was able to confirm this kind of fix to perform as expected.

Another possible solution might be to have a special flag in the client's state just for the failed heartbeat scenario, but I think reserving a special code here is appropriate.

I can happily prepare a PR if we agree this is indeed a bug in phoenix that's worth fixing.

This is my proposed solution that I was able to confirm works as expected.

Looks like the bug was introduced here: 470337d

and this change still accounts for that behaviour that we do not want to reconnect in case the connection was closed really cleanly.

The text was updated successfully, but these errors were encountered:

bartlomiej-korpus · 2022-10-26T15:42:23Z

Fixed by #4921 , thanks to everyone involved! 😀

bartlomiej-korpus changed the title ~~reconnectTimer is sometimes not being reset if heartbeat fails but the connection closes cleanly~~ reconnectTimer is sometimes not being scheduled if heartbeat fails but the connection closes cleanly Sep 13, 2022

bartlomiej-korpus closed this as completed Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reconnectTimer is sometimes not being scheduled if heartbeat fails but the connection closes cleanly #4944

reconnectTimer is sometimes not being scheduled if heartbeat fails but the connection closes cleanly #4944

bartlomiej-korpus commented Sep 13, 2022 •

edited

bartlomiej-korpus commented Oct 26, 2022

reconnectTimer is sometimes not being scheduled if heartbeat fails but the connection closes cleanly #4944

reconnectTimer is sometimes not being scheduled if heartbeat fails but the connection closes cleanly #4944

Comments

bartlomiej-korpus commented Sep 13, 2022 • edited

Environment

Expected behavior

Actual behavior

Full story

Proposed solution

bartlomiej-korpus commented Oct 26, 2022

bartlomiej-korpus commented Sep 13, 2022 •

edited