Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Honor deadman timer, fixes #1093 #1094

Merged
merged 3 commits into from Sep 11, 2022
Merged

Conversation

parsley42
Copy link
Member

@parsley42 parsley42 commented Jul 28, 2022

NOTE: The original description of this PR is outdated, but left here for reference. See the discussion for how the PR was updated.

  • Run make pr-prep from the root of the repository to run formatting, linting and tests.

This patch modifies the main run loop for the socket mode connection. Based on reading the comments, it appears the intention was for run() to return context.Cancelled in the event of a cancelled context, or nil in the event of a different error. This based on the comment:

// wg.Wait() finishes only after any of the above go routines finishes.

... but based on the use of the wait group, it appears that ALL of the goroutines would have to finish, causing the the deadman time to be ignored. This patch removes the wait group (and wg.Wait()) and instead selects on either 1) context cancelled, or 2) timer expired.

I believe this preserves the original intent of the code - but I could be wrong! In any case, this patch fixes #1093 for me, and my robot still runs correctly AFAICS. Ping to @xNok (the original PR author) to have a look.

@parsley42 parsley42 requested a review from kanata2 July 28, 2022 16:29
@xNok
Copy link
Contributor

xNok commented Jul 29, 2022

I didn't write that part, but the wait group is supposed to be the right approach.

Yes, ALL of the goroutines would have to finish, but this is normal because you want a gracefull shutdown. When you want to terminate the run, you need to cancel the context, and then each goroutine should stop gracefully.

Ideally, when the deadman triggers, you should cancel the context and wait for all goroutine to terminate, then reconnect.

As to why it doesn't work properly, I don't know yet 😞

@parsley42
Copy link
Member Author

Yes, ALL of the goroutines would have to finish, but this is normal because you want a gracefull shutdown. When you want to terminate the run, you need to cancel the context, and then each goroutine should stop gracefully.

Ideally, when the deadman triggers, you should cancel the context and wait for all goroutine to terminate, then reconnect.

Ah, ok, thanks for the - hah! - context. Looking at the code, I see what you mean - so, the root of the problem is that one of the items being waited on is ReadJSON, which doesn't take a context. I'm going to mark the PR as draft and work on adding back the waitGroup and trying a different approach.

@parsley42 parsley42 marked this pull request as draft July 29, 2022 12:29
@parsley42 parsley42 removed the request for review from kanata2 July 29, 2022 12:29
@parsley42
Copy link
Member Author

Ok @xNok and @mumoshu - this much smaller PR fixes things for me, and may be "good enough". In short, the runMessageReceiver thread eventually blocks in ReadJSON on the websocket, which doesn't take a context and so doesn't exit gracefully on cancel. Now, all my PR does it to remove that thread from the waitGroup and update some comments. How does this look?

@mumoshu
Copy link
Contributor

mumoshu commented Jul 30, 2022

@parsley42 Hey! Thanks for the fix.
I was rereading the code I've written but missed one thing- It would be great if you could clarify it if possible.
So, In L145-147 we have:

		case <-ctx.Done():
			// Detect when the connection is dead and try close connection.
			if err = conn.Close(); err != nil {

which is running in another goroutine.
I think it is supposed to chime in to close the connection on a context cancellation which in turn fails the ReadJSON attempt on the reader(=websocket.Conn) asap.

Is it that the connection close doesn't result in immediately failing ReadJSON hence your fix is needed?

@parsley42
Copy link
Member Author

parsley42 commented Jul 30, 2022

Is it that the connection close doesn't result in immediately failing ReadJSON hence your fix is needed?

@mumoshu Thanks for having a look! Yes, that's my assessment of the issue. I'll walk you through my analysis.

The core robot code (github.com/lnxjedi/gopherbot) has been running very stable with RTM for many years, but a couple weeks after migrating to socket mode, one of our robots stopped responding to commands, and in the slack app I got a dispatch_failed error when I used the robot's slash command - this led me to suspect the connection had gone bad. Since people were trying to get work done, I didn't take further time to troubleshoot (a netstat would have been good at this point), I just restarted the robot service. When I restarted the robot process, all was well again.

This led me to look at the code for a ping timeout, since I knew slack was sending a ping at regular intervals. To artificially suppress the ping arriving from slack, I used the local firewall to block inbound traffic on the port, then watched the slack logs. It took several minutes for the connection to reset. I repeated the experiment, then sent my SIGUSR1 (forces an abort/stack dump) during the long hang, and found a thread blocking in ReadJSON. Now seeing your comments, I recognize that a full block like this would prevent a graceful tcp connection shutdown. I'm trying to think of a way to better test. Of course, if the vm/container/whatever on the Slack end just goes away, the result would be the same. It would probably be good to reconnect quickly on ping timeout, even in the event that the original connection can't be closed gracefully.

@mumoshu
Copy link
Contributor

mumoshu commented Jul 31, 2022

@parsley42 Thanks for your confirmation!

I read your experience and analysis and went back to gorilla/websocket code. I now got to think that conn.Close is no more than just closing TCP connection without coordination with the opposite side of the connection.

https://github.com/gorilla/websocket/blob/af47554f343b4675b30172ac301638d350db34a5/conn.go#L342-L346

ReadJSON may fail immediately only if the opposite of the connection(=the websocket server) send us a WebSocket "Close" message, which doesn't happen in this scenario.

This is a gorilla/websocket issue that describes a not exactly same but related problem: gorilla/websocket#474

If we were to gracefully stop the bot when the Slack's websocket server vm/container/whatever and the TCP connection in-between is still alive, we might be able to start a normal close procedure by, instead of just Closeing it, writing a control message, like c.WriteControl(websocket.CloseMessage, websocket.FormatCloseMessage(websocket.CloseNormalClosure, "") so that the server will hopefully send us a close message that may fail read and its consumer ReadJSON to fail immediately.

https://github.com/gorilla/websocket/blob/af47554f343b4675b30172ac301638d350db34a5/conn.go#L411-L413

But-

if the vm/container/whatever on the Slack end just goes away, the result would be the same

This is really a good point. If the Slack end disappeared without notifying us, this normal close procedure will end up with a read/connection timeout. We need to handle the timeout in a graceful stop scenario anyway.

That said, I believe your fix is perfectly valid and also the best way to handle the timeout issue!

@parsley42 parsley42 marked this pull request as ready for review July 31, 2022 12:44
@kanata2 kanata2 requested review from kanata2 and removed request for ollieparsley August 1, 2022 02:13
@kanata2 kanata2 added needs review SocketMode about SocketMode bugfix [PR] bugfix labels Aug 1, 2022
@parsley42
Copy link
Member Author

@kanata2 Can you have a look at this and let me know if you have any questions?

@kanata2
Copy link
Member

kanata2 commented Sep 9, 2022

Sorry for slow response. I'll confirm this weekend.

Copy link
Member

@kanata2 kanata2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parsley42
I have read discussions in this PR and agree with you, so will merge this.

@xNok @mumoshu
Thanks for your reviews.

@kanata2 kanata2 merged commit 6f5eda2 into slack-go:master Sep 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix [PR] bugfix SocketMode about SocketMode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

socket mode deadman timer ineffective
4 participants