New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock inside scalatest async tests #3848
Comments
I am sorry that you are experiencing this hang. I know from experience that hangs inside a large I am off-line for most of the next two months, but let's see if we can make some progress. I am guessing that the operating system is OpenBSD? 64bit? You may have way more experience chasing network and thread problems than I, At the root of a decision tree, I think there are three primary possibilities. In my guestimate
I took a quick look at the OpenBSD Something quick to try, worth a 1/2 hour time, not days. Focusing on
Focusing on 'recvfrom
Is the intent that this code run on FreeBSD? macOS? Linux? I am trying to avoid suggesting code changes until we have more evidence pointing to a culprit. Complexity I am trying to avoidIn describing this, I do not mean to offend either you or the code under discussion. Having one thread cancel I/O in another is a well know problem. Very few people get it right and, The usual solution is to have the
|
@LeeTibbert, thanks for reply.
On Sun, 24 Mar 2024 16:04:55 +0100, LeeTibbert wrote:
I am guessing that the operating system is OpenBSD? 64bit?
Yes, it is 64bit OpenBSD.
You may have _way_ more experience chasing network and thread problems than I,
so let`s put our heads together.
At the root of a decision tree, I think there are three primary possibilities. In my guestimate
order of probability:
1) `_libc_recvfrom_cancel` is legitimately waiting in the `recvfrom` portion for a message which will never come: i.e.
never sent or dropped message. SN did the transform from the `read` call to `recvfrom`.
agreed
2) `_libc_recvfrom_cancel` has been interrupted, possibly by the indicated or another
`_rthread_cond_timedwait` and is stuck inside the `_cancel` portion of `_libc_recvfrom_cancel`
I see one other reason to that: attaching by gdb
3) The `java.lang.impl.PosixThreadD4parkjzuEO` framework is messed up.
agreed
I took a quick look at the OpenBSD `$OpenBSD: w_recvfrom.c,v 1.1 2016/05/07 19:05:22 ` code.
and, due to macros in the code, could not distinguish cases 1 & 2. If we are indeed talking about
OpenBSD, I'd have to trace into the macros.
The right move is run test program via ktrace and see kdump. It should show
things like network writes and signals, but not sure about thread interupts.
The simplest way is run it with sbt which leads to huge dump output which
might not be easy to read and investigate.
Something quick to try, worth a 1/2 hour time, not days.
Focusing on `_rthread_cond_timedwait`:
1) If the `cond` timeout value is small, say a few seconds or less, for debugging try increasing to minutes, say 5.
The wild thrash is to give reads time to complete if they are "ever" going to do so. This trial does
not entirely rule out the timer having fired and the signal(?) to `recvfrom` not arriving or causing a hang.
A decrease in apparent failure rate would give evidence that case 1 is happening.
2) If the `cond` timeout value is large, say days or months, try setting it down to, say, 5 times the time
in seconds you expect your computation-under-test to take. If case 1 above is happening, this should
increase the apparent failure rate but also give evidence that case 1 is happening.
Focusing on 'recvfrom` part `_libc_recvfrom_cancel`:
3) Try using the javalib Socket `setSoTimeout(int timeout)` method to set the os fd socket timeout to
a few multiples of your expected computation-under-test time. The argument is Java-style milliseconds
so one must be careful with seconds vs milliseconds.
This (debug) approach should put any 'timeout' handling down at the OS level and avoid any
userspace (quasi-) signal handling and/or C signal blocking and/or C signal insufficient-strength.
(I personally generally design to avoid C signals because of their ~~quirks~~ features. I am told
that on some operating systems, one must enable socket interruptibility-via-signals and/or
avoid some other thread adversely manipulating the signal mask).
<hr \hr>
Is the intent that this code run on FreeBSD? macOS? Linux?
I am trying to avoid suggesting code changes until we have more evidence pointing to a culprit.
IIUC, your interest is in the underlying crypto code and not the socket I/O and or test framework
part and that you are kindly reporting what appears to be sand in the gears of SN.
This code uses in-house libuv wraper, which leads to segfault, I'm still
trying to catch it, and until it isn't stable enough I can't roll out
possibility some side effects.
From antoher hand I've commented the only test which uses that part,
probably it doesn't used at all.
From another hand, if you willing to help dig it out, I may share the code
with promise to not share it yet (feel free to contact me via email which I
use for commits).
###### Complexity I am trying to avoid
In describing this, I do not mean to offend either you or the code under discussion.
Having one thread cancel I/O in another is a well know problem. Very few people get it right and,
I must confess, I am not one of those people.
The usual solution is to have the `read` thread do its own timeout using ` kqueue() EVFILT_TIMER` on
*BSD & macOS and `epoll` with a `timerfd` on Linux (yes os specific code). Setting SO_Timeout on the
os socket, as suggested for debug above, is a more historical approach. `kqueue()` allows the
application to attempt some recovery: more flexibility; more complexity.
`javalib/src/main/scala/java/lang/process/UnixProcessGen2.scala` gives a somewhat close
example for process, instead of timer. Sorry the code is ugly/inefficient but effective.
A `timedRead` implementation would have to `kqueue` for the os socket id (fetched from the Java Socket)
using EVFILT_TIMER.
Yeah, I saw that code when I've porting SN to OpenBSD, and switching to UPG2
leads to broken tests.
I've made a trivial shell to dig into it: https://github.com/catap/shell
which I plan to reuse to add CI to SN.
Frankly speaking I really think that adding BSD to CI is really good move now.
…--
wbr, Kirill
|
Have you maybe observed it also on 0.5.0-RC1? In the RC2 we've added #3827 which retries the
I don't think that might be the case. I'd probably bet on the missing message on either JVM or test-runner side I'm not sure if might be related, but just right now Windows CI has failed due to timeout out on Windows, probably by being stuck in the same way. https://github.com/scala-native/scala-native/actions/runs/8410284069/job/23028532861?pr=3849 |
On Sun, 24 Mar 2024 17:34:49 +0100, Wojciech Mazur wrote:
Have you maybe observed it also on 0.5.0-RC1? In the RC2 we've added #3827 which retries the `read` upon interruption. It was added, becouse attaching to the test-runner was leading the stopping it's execution - it was throwing unhandled SocketException, but I don't see how we might have received an interrupt when running tests.
Nope, but I'd liek to take future your idea with stabilisation of
scalatests' tests. It seems the right move, and I'll dig into it this week.
> The java.lang.impl.PosixThreadD4parkjzuEO framework is messed up.
I don't think that might be the case. `ForkJoinPool` WorkerThreads are daemons and can scale down to if unused for longer time. The 1 thread we've seen in stack dump is probably just the last alive worker waiting for tasks
I'd probably bet on the missing message on either JVM or test-runner side
Which seems quite unlitkley, but who knows.
…--
wbr, Kirill
|
And I have a pice of strange news. I have achieved the same deadlock without scaltests:
|
@WojciechMazur I can reproduce the deadlock by using scalanative from a06f46d, but now I do have only one thread left:
|
I've tried to revert #3827 and deadlock still here. |
Interesting observation with:
deadlock is quite difficult to reproduce in my small test case, but without this block, each attempt leads to deadlock. |
Extracted somehow related issue to dedicated ticket: #3859 |
Future investigation. From
where 3rd column is releative time since the previus event. The last signal was from my |
I use scalatest/scalatest#2318 to run my test suite for ~1.5k tests with majority of them are async.
If I run it as:
it usually pass, and deadlock is reproduced in one out of ten runs. Anyway, without it, almost each run stoped with this deadlock.
It doesn't consume CPU and when I've attached by
gdb
to test program it stuck like this:The text was updated successfully, but these errors were encountered: