New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[馃悰 Bug]: Under high load, timed out sessions are not removed from session queue #13723
Comments
@bhecquet, thank you for creating this issue. We will troubleshoot it as soon as we can. Info for maintainersTriage this issue by using labels.
If information is missing, add a helpful comment and then
If the issue is a question, add the
If the issue is valid but there is no time to troubleshoot it, consider adding the
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable
After troubleshooting the issue, please add the Thank you! |
I may have a clue on the root cause |
Hello, after investigating and correcting my implementation, I thinks there is still a weakness in LocalDistributor implementation
Then all creation threads will be stuck and new session requests will arrive at regular interval because one node can still accept sessions. The LocalDistributor.sessionCreatorExecutor will then contain session request that may have already expired Increasing thread pool size could be a workaround but would there be a mean to check that the sessionCreatorExecutor has room for creating sessions ? |
I see that you have few resources, and when a Node is slow in creating a session, it affects the rest because not many threads are being processed. You could increase the |
Hello @diemol , Thanks for your reply |
You are correct; |
Hello, increasing thread pool and also a filter on sessionCreatorExecutor queue, so that no new sessions requests are added if some are still waiting in queue. |
Thank you for sharing that information. Do you think we need to do something else or we can close this issue? |
As I said, I think something could be improved in handling new sessions request especially when hub or node are slow, namely to avoid creating sessions that are already timed out (for example, one could add a job like the one in LocalNewSessionQueue) that would remove session requests that are timed out, before trying to create them But if I'm the only one to have this problem, then I can leave with my workaround that does this (and a bit more) |
But if a session request has timed out, the Distributor should not be able to retrieve it. Maybe I do not understand the scenario. |
In the normal process:
In case of high load (ten's of sessions received by the hub and few nodes to handle them) AND session-timeout (in my case 30 secs) set on NewSessionQueue:
|
But the Distributor only takes more session requests if there is a stereotype (slot) available on the Node. That is why I am confused. |
I think you talk about this portion of code: This only check if there is at least one slot available. If it's the case, then (as soon as I understand the code correctly), then all the slots are returned. Imagine you have a node that can handle:
|
@bhecquet starting more sessions than slots is pervented later in this lines: selenium/java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java Lines 550 to 565 in b7d831d
|
@joerg1985, yes, sure, but the problem is in the delay starting session when nodes / distributor are slow, not the fact that there are more sessions created than expected |
@bhecquet my answer should say: rejecting these more new session requests than slots will happens pretty fast and should not add to much overhead in processing. |
@joerg1985 , you're right, this step is quick. But in case a slot is free at this moment, a session will be created whereas it may already have expired, which takes time on overused LocalDistributor. |
What happened?
Hello,
on our setup (a hub with 15 nodes), under high load, we see that the hub tries to create sessions that are already timed out.
I can reproduce this with a rather smaller setup:
--session-request-timeout 30 --session-retry-interval 1000 --reject-unsupported-caps true
Hub is running on a small server (1 CPU)
This test is very simple: it starts a browser, fill 2 text fields and click on 1 button. If necessary I will provide a pure Selenium script (for now, it's using our framework, but I'm pretty sure the problem is not there
Expected results
Grid continues to create session according to its capacity
Current result:
No more session created: this is due to #12848 (which is a good point) which kills the browser immediately if the session is timed out
At this point, client (see the logs) receives the message "New session request timed out " as expected
Reading the code in LocalNewSessionQueue, I can't see why this can happen because there are 2 guards
selenium/java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java
Line 160 in 49214cd
https://github.com/SeleniumHQ/selenium/blob/49214cd40467c2926d1b450d86dde9363bd0acd3/java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java#L235C1-L240C8
which removes session request from queue in any case
I've added some logs to grid and to capabilities to follow the session request flow
Looking at the logs, I can see 2 problems:
Here, the session request is not in the queue anymore so it cannot be timed out
But, (17:04:18,884) handleNewSessionRequest is done very late, I suppose because there is queueing in sessionCreatorExecutor
Do you confirm my analysis ?
A workaround would be to increase the CPU number / or thread pool of the executor, but this would only be a workaround
I'll try to imagine a correction, but don't hesitate to suggest ideas
Bertrand
How can we reproduce the issue?
Relevant log output
Operating System
Linux
Selenium version
4.11.0
What are the browser(s) and version(s) where you see this issue?
Chrome / not related to browser
What are the browser driver(s) and version(s) where you see this issue?
not related to browser
Are you using Selenium Grid?
4.16.1
The text was updated successfully, but these errors were encountered: