New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[馃悰 Bug]: Chrome nodes stuck on Termination state #2168
Comments
@Aymen-Ben-S, thank you for creating this issue. We will troubleshoot it as soon as we can. Info for maintainersTriage this issue by using labels.
If information is missing, add a helpful comment and then
If the issue is a question, add the
If the issue is valid but there is no time to troubleshoot it, consider adding the
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable
After troubleshooting the issue, please add the Thank you! |
Just to add some additional context about the problem above... it is the same mentioned here: #2129 (comment) At that time @VietND96 though preStop was stuck and suggested to get some logs from there. Are these logs the same we have in chrome node (obtained via kubectl logs) or is there any other log persisted inside the container and not exposed? We are using 4.18 images as mentioned above, chrome node has no errors in the log, but the pod does not terminate even after their processes receive the SIGTERM (they keep as Running and/or Terminating - if we restart hub/keda). We do not see this behavior consistently, but once the first node hangs, all the others will hang. It seems cluster/grid enters in a state where selenium processes cannot be killed and nodes live forever. |
Your K8s env is using runtime contained, right? I am asking this since when looking into message |
@VietND96 our runtime is really running in containers in the same cluster, which are connected with hub via remote web driver (java client). We are not using Thanks for the link to the other issue... overall the way we reach to the problem is different, though the absence of that file/dir is quite the same.. will keep an eye on that as well. |
Can you also try to set |
Just to add two more examples... I have in my live env here one chrome node in running state and one chrome node in terminating state... both are stuck (usually the "running" stuck pods turn to "terminating" when we restart hub/keda pods... I do not see events similar to the one posted above by @Aymen-Ben-S - but I see preStop failed anyway. Follow pod describe: Stuck Running Node
What is interesting with this running pod is that it has no Event associated, but if we look at their logs, selenium processes were supposed to have terminated as they got SIGTERM:
Stuck Terminating Node
|
@VietND96 just to give some feedback: we did not notice any node hanging so far, after:
We will keep an eye on our next test runs anyway, but so far we had very good results compared with previous runs. |
@VietND96 @Aymen-Ben-S noticed hanging in one of our environments. Follow some details: List of Pods in selenium namespace sorted by Created Date: Only the top chrome node is really runnning, the one below, even though it is running, it hung. Follow describe from this pod (noticed a startup probe warning, but that is all):
Follow logs from that same pod:
I have noticed also an error in keda-operator-metrics, take a look:
And here, just as a reference, the log from keda-opetarator (beginning):
.. and here some of the last lines:
|
@alequint, thanks for your continuous feedback. In chart 0.29.1, I just updated the preStop script with the expectation that it would prevent stuck somewhere. |
Thanks @VietND96... we plan to upgrade our images to the most recent release and try again. Considering we are using |
Hi @VietND96 , I'm having this problem to, but in k8s & with 0.30.0 chart, selenium/node-chrome:124.0 & selenium/video:ffmpeg-7.0-20240425
video container:
|
@Doofus100500, may I know some info. |
|
@VietND96 Going through different versions of images didn't lead to anything; the issue with video being recorded after sending the file remains, if I understood the problem correctly. |
@Doofus100500, how about the K8s version that you are using? I tried v1.30.0 and saw that the video file was broken as you mentioned. |
@VietND96 |
@VietND96 Can you try in v1.30.0 with chart 0.29.1 ? |
Sure, I am inprogress to reproduce and figure out the issue. Will have a patch if anything can be fixed or rollback ffmpeg version. |
@VietND96 Interesting observation, the problem does not occur if there are many resources in the namespace and sessions do not wait in the queue. Maybe it will be useful for you |
@Doofus100500, looks like rollback ffmpeg |
@Doofus100500, new image tag and chart |
@VietND96 With the video files everything is good now, but the pods are still remain in the Terminating status.
That's wonderful, thank you, it will be helpful! |
@Doofus100500, can you try to enable |
And in k9s in video container i see:
|
Looks interesting; the record container was terminated by SIGKILL. Meanwhile, the Node container continues with a few sessions. Let me check on this further |
I'm seeing this issue on 0.30.1. I set
Followed by:
But the pod never actually terminates. |
What happened?
We have an consistent behavior where Chrome nodes get stuck on Terminating state.
I'm not sure I can provide the exact steps to reproduce but I'm happy to share logs from a system where this is happening.
Command used to start Selenium Grid with Docker (or Kubernetes)
Relevant log output
Operating System
Openshift 4.12.49
Docker Selenium version (image tag)
4.18.1
Selenium Grid chart version (chart version)
0.28.4
The text was updated successfully, but these errors were encountered: