Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bolt + Flask + Kubernetes inevitably starts throwing WebSocketConnectionClosedException #445

Open
IanWhalen opened this issue Aug 17, 2021 · 8 comments

Comments

@IanWhalen
Copy link

Running a bolt app using socket mode inside Flask inside kubernetes works initially but eventually always loses the connection and falls back to a WebSocketConnectionClosedException error.

Given that auto_reconnect_enabled defaults to True, I would expect any failures to just result in the app reconnecting.

I'm opening this as a question as I'm highly doubtful its an actual bug, and instead just something I need to do differently/better in my own app code.

Reproducible in:

The slack_bolt version

slack-bolt = "1.6.0"
slack-sdk = "3.8.0"
websocket-client = "1.1.0"

Python runtime version

python3.7

OS info

Problem is seen running in a container.

Steps to reproduce:

I've tried to emulate the pattern in #255 for running bolt + slack, so a simplified version of my app looks like this:

# ./app.py
from flask import Flask
from slack_app.slack_service import slack

slack.connect()

app = Flask(__name__)
# ./slack_app/slack_service.py
from slack_bolt import App
from slack_bolt.error import BoltUnhandledRequestError
from slack_bolt.adapter.socket_mode.websocket_client import SocketModeHandler

SLACK_APP_TOKEN, SLACK_BOT_TOKEN = get_slack_tokens_from_env()

app = App(
    token=SLACK_BOT_TOKEN,
    raise_error_for_unhandled_request=True,
)
slack = SocketModeHandler(app, SLACK_APP_TOKEN)

@app.error
def handle_errors(error):
    if isinstance(error, BoltUnhandledRequestError):
        pass
    else:
        logger.error(error)

I doubt the BoltUnhandledRequestError is causing this but included it in my example code just in case.

Maybe of note is that i'm using websocket_client based on the suggestion in slackapi/python-slack-sdk#1024. We were seeing the same BlockingIOError logs.

Also maybe of note is that in #255 you suggest using two threads for gunicorn and we are just currently running with:

gunicorn app:app --workers=1 --bind=0.0.0.0:8080 --timeout=3600

Lastly of note is that I am unable to repro this problem locally, and I'm just seeing it inside of our kubernetes cluster. Unfortunately I'm not savvy enough to know how to debug whether the k8s infra is causing my problem (although I am simultaneous to filing this issue working with the people who maintain that infra to investigate from that side).

Expected result:

My slack connection doesn't die.

Actual result:

My app connects fine initially, but after some period of time disconnects from slack and the logs quickly degenerate into the following error every 5 seconds:

on_error invoked (error: WebSocketConnectionClosedException, message: Connection to remote host was lost.)

@seratch seratch added this to the 1.x milestone Aug 18, 2021
@seratch
Copy link
Member

seratch commented Aug 18, 2021

Hi @IanWhalen, thanks for sharing the details of the issue. All the information in the description are really helpful. But unfortunately, I'm also unsure what the root cause of the issue is at this moment.

Lastly of note is that I am unable to repro this problem locally, and I'm just seeing it inside of our kubernetes cluster.

If you don't mind my asking, would it be possible to ask you to check if the same situation can arise with a simpler app in your k8s cluster? I'm guessing that gunicorn + web framework might not be the contributing factors of the failure of reconnection in this scenario. If so, just running a simple WebSocket client process can reproduce the same situation. With the following example, you can see more detailed information about the app (debug level logging + trace_enabled option)

# app.py

# Prints debug-level logs
import logging
logging.basicConfig(level=logging.DEBUG)

import os
from threading import Event
from slack_sdk.web import WebClient
from slack_sdk.socket_mode.response import SocketModeResponse
from slack_sdk.socket_mode.request import SocketModeRequest
from slack_sdk.socket_mode.websocket_client import SocketModeClient

client = SocketModeClient(
    app_token=os.environ.get("SLACK_APP_TOKEN"),
    web_client=WebClient(token=os.environ.get("SLACK_BOT_TOKEN")),
    trace_enabled=True,  # print more details happening under the hood
)

if __name__ == "__main__":
    def process(client: SocketModeClient, req: SocketModeRequest):
        if req.type == "events_api":
            response = SocketModeResponse(envelope_id=req.envelope_id)
            client.send_socket_mode_response(response)

    client.socket_mode_request_listeners.append(process)
    client.connect()
    Event().wait()
# requirements.txt
slack-sdk
websocket-client
# Dockerfile
# docker build . -t your-repo/hello-bolt
# docker run -e SLACK_APP_TOKEN=$SLACK_APP_TOKEN -e SLACK_BOT_TOKEN=$SLACK_BOT_TOKEN -it your-repo/hello-bolt
#
FROM python:3.9.5-slim-buster
WORKDIR /app/
COPY *.py /app/
COPY requirements.txt /app/
RUN pip install -U pip && pip install -r requirements.txt
ENTRYPOINT python app.py

If your Docker containers require a healthcheck endpoint, please go with the following example:

def process(client: SocketModeClient, req: SocketModeRequest):
    if req.type == "events_api":
        response = SocketModeResponse(envelope_id=req.envelope_id)
        client.send_socket_mode_response(response)

client.socket_mode_request_listeners.append(process)
client.connect()

from flask import Flask
flask_app = Flask(__name__)

This example requires flask, gunicorn in requirements.txt. Also, the Dockerfile will be mostly the same with your current one.

If we figure out that there is an issue (or room for improvement for stability) on this library side, we will fix the issue as quickly as possible.

@IanWhalen
Copy link
Author

IanWhalen commented Aug 19, 2021

Thanks @seratch. I'm going to investigate trying to make the minimum viable reproducer with the pointers you provided. In the meantime I think I have found a local reproducer - or at least I've found a way to reproduce the symptoms - which might indicate this isn't necessarily a container issue.

This is going to sound crazy, but I can very reliably trigger this error by:

  1. start app in debug mode.
  2. put in a breakpoint right before a call to slack.chat_postMessage().
  3. wait more than 30-60 seconds on that breakpoint.
  4. Continue past breakpoint into chat_postMessage() call.

The message will successfully post, but I'll get the following when I have trace_enabled for SocketModeHandler:

2021-08-19 13:36:48.159562 [info     ] normal app logging message here
2021-08-19 13:36:48.213292 [info     ] another normal app logging message here
# see timestamps - paused on breakpoint.
2021-08-19 13:38:01.545749 [debug    ] ++Sent raw: b'\x89\x80\xe7\xd8n}'
2021-08-19 13:38:01.546778 [debug    ] ++Sent decoded: fin=1 opcode=9 data=b''
2021-08-19 13:38:01.565154 [debug    ] ++Rcv raw: b'\x89)Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.565405 [debug    ] ++Rcv decoded: fin=1 opcode=9 data=b'Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.567176 [debug    ] ++Sent raw: b'\x8a\xa9\x03;e\x91SR\x0b\xf6#]\x17\xfen\x1b\x04\xe1sW\x0c\xffh\x16\x06\xf0mZ\x17\xe8.\x0eQ\xa74X\x01\xf2;\r]\xbcd\tP\xf9p'
2021-08-19 13:38:01.567456 [debug    ] ++Sent decoded: fin=1 opcode=10 data=b'Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.572026 [debug    ] ++Rcv raw: b'\x89)Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.572865 [debug    ] ++Rcv decoded: fin=1 opcode=9 data=b'Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.574623 [debug    ] ++Sent raw: b'\x8a\xa9\x08*\x16HXCx/(Ld\'e\nw8xF\x7f&c\x07u)fKd1%\x1f"~?Ir+0\x1c.eo\x18# {'
2021-08-19 13:38:01.574990 [debug    ] ++Sent decoded: fin=1 opcode=10 data=b'Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.580777 [debug    ] ++Rcv raw: b'\x89)Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.581151 [debug    ] ++Rcv decoded: fin=1 opcode=9 data=b'Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.581619 [debug    ] ++Sent raw: b'\x8a\xa9\xa5\xc3\xc3\xf5\xf5\xaa\xad\x92\x85\xa5\xb1\x9a\xc8\xe3\xa2\x85\xd5\xaf\xaa\x9b\xce\xee\xa0\x94\xcb\xa2\xb1\x8c\x88\xf6\xf7\xc3\x92\xa0\xa7\x96\x9d\xf5\xfb\xd8\xc2\xf1\xf6\x9d\xd6'
2021-08-19 13:38:01.582048 [debug    ] ++Sent decoded: fin=1 opcode=10 data=b'Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.582816 [debug    ] ++Rcv raw: b'\x89)Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.583121 [debug    ] ++Rcv decoded: fin=1 opcode=9 data=b'Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.583711 [debug    ] ++Sent raw: b'\x8a\xa9Rt\xa6\x0f\x02\x1d\xc8hr\x12\xd4`?T\xc7\x7f"\x18\xcfa9Y\xc5n<\x15\xd4v\x7fA\x929e\x17\xc2ljB\x9e"5F\x93g!'
2021-08-19 13:38:01.584062 [debug    ] ++Sent decoded: fin=1 opcode=10 data=b'Ping from applink-canary-5467cdc868-g25hs'
2021-08-19 13:38:01.585277 [error    ] on_error invoked (error: WebSocketConnectionClosedException, message: Connection to remote host was lost.)
2021-08-19 13:38:01.935678 [error    ] Traceback (most recent call last):
  File "/Users/ian/10gen/mothra/foo.py", line 23, in bar
    slack.chat_postMessage(
  File "/Users/ian/code/10gen/mothra/.venv/lib/python3.9/site-packages/slack_sdk/web/client.py", line 1266, in chat_postMessage
    return self.api_call("chat.postMessage", json=kwargs)
  File "/Users/ian/code/10gen/mothra/.venv/lib/python3.9/site-packages/slack_sdk/web/base_client.py", line 135, in api_call
    return self._sync_send(api_url=api_url, req_args=req_args)
  File "/Users/ian/code/10gen/mothra/.venv/lib/python3.9/site-packages/slack_sdk/web/base_client.py", line 171, in _sync_send
    return self._urllib_api_call(
  File "/Users/ian/code/10gen/mothra/.venv/lib/python3.9/site-packages/slack_sdk/web/base_client.py", line 303, in _urllib_api_call
    return SlackResponse(
  File "/Users/ian/code/10gen/mothra/.venv/lib/python3.9/site-packages/slack_sdk/web/slack_response.py", line 204, in validate
    raise e.SlackApiError(message=msg, response=self)
slack_sdk.errors.SlackApiError: The request to the Slack API failed.
The server responded with: {'ok': False, 'error': 'channel_not_found'}
--- request header ---
2021-08-19 13:38:02.520684 [debug    ] --- request header ---
GET /link/?ticket=ef1b4d46-e81b-4268-8f32-a984d529ef7f&app_id=fa84b3a846b741569a6cbf95af546404fceb70f9deb34bb48e2e384da83faaa2 HTTP/1.1
Upgrade: websocket
Host: wss-primary.slack.com
Origin: http://wss-primary.slack.com
Sec-WebSocket-Key: gYz1neO156Ngg+45IVaCkQ==
Sec-WebSocket-Version: 13
Connection: Upgrade


2021-08-19 13:38:02.521086 [debug    ] GET /link/?ticket=ef1b4d46-e81b-4268-8f32-a984d529ef7f&app_id=fa84b3a846b741569a6cbf95af546404fceb70f9deb34bb48e2e384da83faaa2 HTTP/1.1
Upgrade: websocket
Host: wss-primary.slack.com
Origin: http://wss-primary.slack.com
Sec-WebSocket-Key: gYz1neO156Ngg+45IVaCkQ==
Sec-WebSocket-Version: 13
Connection: Upgrade

2021-08-19 13:38:02.521480 [debug    ] -----------------------
2021-08-19 13:38:02.521874 [debug    ] --- response header ---
2021-08-19 13:38:02.541301 [debug    ] HTTP/1.1 101 Switching Protocols
2021-08-19 13:38:02.541848 [debug    ] upgrade: websocket
2021-08-19 13:38:02.542421 [debug    ] connection: upgrade
2021-08-19 13:38:02.543265 [debug    ] sec-websocket-accept: jBgAPGve7avK99CAimVrhq4qjkk=
2021-08-19 13:38:02.543854 [debug    ] date: Thu, 19 Aug 2021 13:38:02 GMT
2021-08-19 13:38:02.544530 [debug    ] server: envoy
2021-08-19 13:38:02.545074 [debug    ] via: envoy-wss-iad-crpa
2021-08-19 13:38:02.545351 [debug    ] -----------------------
2021-08-19 13:38:02.552737 [debug    ] ++Rcv raw: b'\x81\x1cUNAUTHENTICATED: cache_error'
2021-08-19 13:38:02.553021 [debug    ] ++Rcv decoded: fin=1 opcode=1 data=b'UNAUTHENTICATED: cache_error'
2021-08-19 13:38:07.569974 [error    ] on_error invoked (error: WebSocketConnectionClosedException, message: Connection to remote host was lost.)

Interestingly the logs will continue to spam the on_error message but subsequent calls to chat_postMessage will succeed, while using shortcuts in Slack that call out to my app will fail.

Is that at all helpful?

edit: also of note is that I'm just running app.py directly rather than using gunicorn locally.

@seratch
Copy link
Member

seratch commented Aug 19, 2021

@IanWhalen Thanks for sharing this. I can check this tomorrow in my timezone.

But one thing I'm wondering is that this may not be a reproducer of the issue that you mentioned in the description. My understanding of this is that, as you paused the app for a while, the underlying WebSocket client lost the connection and it automatically reconnected after the pause. Indeed, this is not a normal situation but the Socket Mode client has successfully recovered without restarting it.

Also, I'm curious that what happens if you do the same experiment with the built-in Socket Mode client.

Given that auto_reconnect_enabled defaults to True, I would expect any failures to just result in the app reconnecting.

Can you elaborate this part in your description? Does this mean your production app failed to auto reconnect when the exception was thrown? Or, while you found the error logs, did the app continue working after that (=auto-reconnection still works)?

@seratch
Copy link
Member

seratch commented Aug 19, 2021

Also, if the only reason that you don't use the built-in client is noisy BlockingIOError error logs, we are happy to improve the logging in the next release. Getting BlockingIOError may be inevitable in some situations (such as containers or a specific operating system), but as long as the frequency of reconnection is fine, just adjusting the logs in the scenario should be reasonable.

@seratch
Copy link
Member

seratch commented Sep 24, 2021

Hi @IanWhalen, I'm not sure if you're still checking this issue but recently we've applied a few minor improvements to the websocket-client module based Socket Mode client. I don't think that those updates can eliminate the error you've faced but you may see some difference with the latest version (slack-sdk 3.11.2, slack-bolt 1.9.1 as of today).

@IanWhalen
Copy link
Author

Thanks @seratch.

I'm reading through the release notes now. Am I correct that https://github.com/slackapi/python-slack-sdk/pull/1114/files is where you made most of the improvements?

And given the flow of this conversation, I think you can feel pretty good about resolving this issue for now and we'll try out the updated versions.

@seratch
Copy link
Member

seratch commented Nov 2, 2021

@IanWhalen Sorry for my very late response here. I don't think the diff you've mentioned can eliminate your issue here but trying the latest versions of both slack-bolt and websocket-client may be worth trying. I hope the error will be resolved for you soon.

@khyatielhance
Copy link

@IanWhalen : Did upgrading the versions resolved your issue. I am facing a similar issue. When i am only running flask app on kubernetes pod, it runs fine but when i run bolt+flask app, it is throwing exception

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants