Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to avoid timing windows in WebSocket eventstream resulting in blocked streams in edge case reconnect scenario #152

Merged
merged 3 commits into from
Sep 28, 2021

Conversation

peterbroadhurst
Copy link
Contributor

@peterbroadhurst peterbroadhurst commented Sep 28, 2021

We've seen a hard to diagnose pattern of logs, that seem to result in an event stream getting stuck:

[2021-09-28T00:55:53.412Z]  INFO WS/d82a5fe8-8356-4ce9-7b3b-c749d0498deb: Connected
[2021-09-28T00:55:53.525Z]  WARN es-45c43457-c322-4883-41ad-d256af5b9325: Is currently blocked. InFlight=808 BatchSize=50
[2021-09-28T01:07:49.695Z] ERROR WS/d82a5fe8-8356-4ce9-7b3b-c749d0498deb: Error: websocket: close 1006 (abnormal closure): unexpected EOF
[2021-09-28T01:07:49.695Z]  INFO WS/d82a5fe8-8356-4ce9-7b3b-c749d0498deb: Disconnected
[2021-09-28T01:07:49.695Z]  INFO WS/d82a5fe8-8356-4ce9-7b3b-c749d0498deb: Closing
[2021-09-28T01:07:49.695Z] ERROR es-45c43457-c322-4883-41ad-d256af5b9325: Batch 1 attempt 1 failed. ErrorHandling=block BlockedRetryDelay=30s
[2021-09-28T01:07:49.695Z]  INFO WS/d82a5fe8-8356-4ce9-7b3b-c749d0498deb: Disconnected
[2021-09-28T01:07:50.132Z]  WARN es-45c43457-c322-4883-41ad-d256af5b9325: Is currently blocked. InFlight=808 BatchSize=50
[2021-09-28T01:07:51.133Z]  WARN es-45c43457-c322-4883-41ad-d256af5b9325: Is currently blocked. InFlight=808 BatchSize=50
[2021-09-28T01:07:52.134Z]  WARN es-45c43457-c322-4883-41ad-d256af5b9325: Is currently blocked. InFlight=808 BatchSize=50
[2021-09-28T01:07:53.134Z]  WARN es-45c43457-c322-4883-41ad-d256af5b9325: Is currently blocked. InFlight=808 BatchSize=50
[2021-09-28T01:07:54.134Z]  WARN es-45c43457-c322-4883-41ad-d256af5b9325: Is currently blocked. InFlight=808 BatchSize=50
[2021-09-28T01:07:54.890Z]  INFO WS/e3dbcfa5-1e71-47d6-5719-e2b3b1e880c8: Connected
[2021-09-28T01:07:55.135Z]  WARN es-45c43457-c322-4883-41ad-d256af5b9325: Is currently blocked. InFlight=808 BatchSize=50

This PR provides a very speculative change that might fix this.

The cycleTopic code assumes it's safe to close the t.closingChannel and then create a new one, at the same time another thread is spinning around a retry loop.
I'm speculating that isn't actually safe, and it's possible the loop comes back round and ends up getting the wrong t.closingChannel when it comes back round here in some timing condition:

sender, broadcaster, receiver, closing := w.es.wsChannels.GetChannels(topic)

The result is the next disconnect doesn't actually wake the loop, because it's listening on the wrong t.closingChannel.

So this PR does two things:

  1. Stops closing the channel, instead does a non-blocking push of a bool down the channel
  • See below, removed channel completely
  1. Adds extra logging to hopefully see more information, if this problem re-occurs with the speculative fix in place

…gging

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@peterbroadhurst
Copy link
Contributor Author

Been thinking a bit more about this, and I realize there is still a window. Going to work a bit more on a fix.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@codecov-commenter
Copy link

codecov-commenter commented Sep 28, 2021

Codecov Report

Merging #152 (9aeade1) into main (8545f1c) will increase coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #152      +/-   ##
==========================================
+ Coverage   97.43%   97.46%   +0.03%     
==========================================
  Files          56       56              
  Lines        6590     6595       +5     
==========================================
+ Hits         6421     6428       +7     
+ Misses        128      127       -1     
+ Partials       41       40       -1     
Impacted Files Coverage Δ
ethconnect/internal/ws/wsserver.go 100.00% <0.00%> (ø)
ethconnect/internal/events/websockets.go 100.00% <0.00%> (ø)
ethconnect/internal/ws/wsconn.go 98.91% <0.00%> (+0.01%) ⬆️
ethconnect/internal/events/eventstream.go 97.95% <0.00%> (+0.58%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8545f1c...9aeade1. Read the comment docs.

@peterbroadhurst peterbroadhurst changed the title Use non-blocking push of bool rather than close channel, and extra logging Updates to ack handling logic to avoid timing windows for WebSocket event stream ack mis-handling Sep 28, 2021
@peterbroadhurst peterbroadhurst changed the title Updates to ack handling logic to avoid timing windows for WebSocket event stream ack mis-handling Updates to avoid timing windows in WebSocket eventstream resulting in blocked streams in edge case reconnect scenario Sep 28, 2021
@peterbroadhurst
Copy link
Contributor Author

Ok - a further update to the fix, which makes me more confident about it.

I have removed the closing channel completely. Instead the receiver channel is used, which already accepts an error.
Now when a websocket closes that was subscribed to a topic, it pushes an error into the receiver of that topic.
The eventstreams logic drains the receiver before sending a batch, so that the risk of processing a closed error from a previous connection is narrowed to a point where it's indistinguishable between sending on a connection that immediately terminates, and sending on a connection that just terminated as were sending it.
If it does happen, all that we'll get is a single batch redelivery.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
close(t.closingChannel)
t.closingChannel = make(chan struct{})
select {
case t.receiverChannel <- errors.Errorf(errors.WebSocketClosed, connInfo):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the meat of the change.

@gabriel-indik gabriel-indik merged commit e708f38 into hyperledger:main Sep 28, 2021
@peterbroadhurst peterbroadhurst deleted the close-revamp branch September 28, 2021 17:07
@gabriel-indik gabriel-indik linked an issue Sep 28, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Event stream intermittently blocks on reconnection
4 participants