New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Socket Mode disconnection issue with the aiohttp-based client #1110
Comments
Hi @matthieucan, thanks for taking the time to report this issue (again)! I'm sorry to hear that the fix we had didn't help for your case. I will try to reproduce the situation and do more testing on my end but can I ask you to help us identify the cause of the issue by running a simple app with debug level logging? Having the following lines of code at the top of your source code enables the debug level logging. import logging
logging.basicConfig(level=logging.DEBUG) If you find any suspicious behaviors in the detailed logs, sharing the outputs with us would be greatly appreciated. |
Hi @seratch, thank you, I'm now running it with debug logs. I can't guarantee I'll let it run until the bug is triggered, as this application is critical enough to warrant reverting the Slack Bolt upgrade. I'll let you know! |
Thanks for your help and we're sorry about the disruption of your apps due to this issue. I have been running a simple Socket Mode app in the same Docker image for two days but I haven't managed to reproduce the situation yet. I'm thinking that some race conditions that can arise only when an app is receiving more requests. FWIW, the following is the example app I'm using now: main.py: import logging
logging.basicConfig(level=logging.DEBUG)
import os
from slack_bolt.async_app import AsyncApp
app = AsyncApp(token=os.environ["SLACK_BOT_TOKEN"])
@app.command("/do-something")
async def handle_some_command(ack, body, logger):
await ack()
logger.info(body)
from slack_bolt.adapter.socket_mode.async_handler import AsyncSocketModeHandler
async def main():
handler = AsyncSocketModeHandler(app, os.environ["SLACK_APP_TOKEN"])
await handler.start_async()
if __name__ == "__main__":
import asyncio
asyncio.run(main()) Dockerfile: FROM python:3.8.10-slim-buster as builder
RUN apt-get update && apt-get clean
COPY requirements.txt /build/
WORKDIR /build/
RUN pip install -U pip && pip install -r requirements.txt
FROM python:3.8.10-slim-buster as app
COPY --from=builder /build/ /app/
COPY --from=builder /usr/local/lib/ /usr/local/lib/
WORKDIR /app/
COPY *.py /app/
ENTRYPOINT python main.py requirements.text:
|
Hi @seratch,
This looks correct. For the record I have a copy of the app running in a development cluster, that doesn't receive nearly as many requests, and it never broke like the prod one. |
Hi @matthieucan, we've merged more fixes for this issue at #1112 and released a pre-release version - v3.11.0rc1: https://pypi.org/project/slack-sdk/3.11.0rc1/ May I ask you to try the RC1 version in your app? |
Thanks a lot @seratch! I'm now running this pre-release. I'll let you know if anything happens (or not) :) |
Hi @seratch, the issue happened again. Here are the DEBUG logs:
Let me know if there's anything else I can help you with! |
@matthieucan Thanks for sharing the logs. With the given information, your app became unable to connect to and keep a connection with the Slack server side for some reason. Then, the reconnection started failing in the environment.
These lines are unexpected. Although the reconnection succeeded 10 seconds before the If you don't mind, let me ask a few follow-up questions:
Lastly, we'll continue trying to resolve this issue but we are still trying to identify the cause. Thus, we may need more time to completely resolve this issue. As of today, we've never received similar issue reports about the built-in Socket Mode client, which is the most commonly used implementation. If you don't have a strong reason to choose an asyncio-based solution and you don't mind spending more time to change your app, switching to the built-in one may help. |
I just reopened this issue and will continue trying to resolve this in the next version. |
As the first step, I created a minor update pull request that may mitigate the false behavior here: #1117 |
Hi @seratch, thanks a lot for investigating.
No I can't find anything related. Those are the logs of the app before that time:
Nothing special in there, a non-Slack aiohttp endpoint is hit and replies with HTTP 200. Worth noting that this non-Slack endpoint continues to function when the Slack app becomes non-reachable. Other logs concern K8s healthcheck probes, nothing different than usual there.
I did a search in all logs and found these occurences:
This is very interesting:
No, nothing relevant until I manually killed the app ~1h later.
I confirm it's the same, I was informed around 1 hour later (than those logs) that the app was not responsive (so I can only assume it was not responsive in the last hour).
Thank you so much for your efforts. I can't easily move away from the aiohttp integration, as the app is integrated with a few non-Slack endpoints, but I'll keep that in mind. |
@matthieucan Thanks for sharing further info. So, you're running the If so, your app can use Although I don't think that being part of AIOHTTP web app can be the cause of the connectivity issue, I've never done any testing with such an application. I will check if there is any difference in the case by running a simple web app. |
Yes, that's correct.
This sounds like a very nice mitigation strategy, albeit requires an additional background task. Do you think something like checking every 10 seconds is fine?
Super, thank you! |
Yes, I do. Also for better safety, I would recommend recreating your client if the background job sees 2+ consecutive False response |
Hi @matthieucan, I just released a new patch version, which includes many improvements on the asyncio-based Socket Mode clients: https://github.com/slackapi/python-slack-sdk/releases/tag/v3.11.1 I hope the changes eliminate the issues you're facing! But, even if they only mitigate, newly added debug-level logging will provide more useful information for further investigation. Also, as the Thanks for being patient with this issue. Please try the latest version if you have a chance 🙇 |
Hi @seratch , |
@seratch I'm encountering similar issues. The Slack SDK versionslack-bolt==1.9.2 Python runtime versionPython 3.8.12 (Also running in a Docker container based on python:3.8-slim-buster) I'm experiencing consistent 'missed' messages and delayed messaging processing (sometimes 5-6 minutes) that are predicated on websocket connection recycling related to this issue, e.g.
Implementation is pretty much spot on the documentation:
The common pattern would be, the bot processes a single message correctly. Another message executed directly after the initial one is not detected and there's no logging (indicating the message isn't received from the websocket) and then some arbitrary time later the message will get processed--this could be 30 seconds and sometimes I've seen over 5 minutes. While waiting for a reset of the websocket connection, debug lines like this appear:
I'm not sure if this is normal behavior and a red-herring, or indicative of an issue with the app. Unfortunately, there's not really any other meaningful logs in the stream, and I've verified with my networking team there's no policies affecting connections to Slack's API. Thanks in advance. |
Hi @seratch, Thank you in advance for your help!
|
Hi @seratch, this didn't happen since my last report above. I believe we can close this issue :) Thank you again for your help! |
That's great to hear! Thanks for sharing the result 👍 Let me close this issue now. |
Hello,
I'm following up after the issue #1065 and upgrading to Slack Bolt 1.8.
Shortly after upgrading, I experienced disconnections. This happened twice in the last 3 days - with the previous version the disconnection has happened every few weeks.
What's visible from my side is a Slack application still running, but unreachable through Slack slash commands.
There are no logs related to the websocket, contrary to #1065. The bug symptoms are the same though, which makes me believe it's related.
Reproducible in:
The Slack SDK version
Python runtime version
OS info
(running in a Docker container based on
python:3.8.10-slim-buster
)Steps to reproduce:
Run the app with socket mode for a few days.
Expected result:
The app is always reachable.
Actual result:
Slash commands or interacting with the app messages lead to the Slack error
"/foo failed with the error "dispatch failed"
.Thank you for your consideration!
The text was updated successfully, but these errors were encountered: