Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock inside scalatest async tests #3848

Open
catap opened this issue Mar 22, 2024 · 10 comments
Open

Deadlock inside scalatest async tests #3848

catap opened this issue Mar 22, 2024 · 10 comments
Labels

Comments

@catap
Copy link
Contributor

catap commented Mar 22, 2024

I use scalatest/scalatest#2318 to run my test suite for ~1.5k tests with majority of them are async.

If I run it as:

Global / concurrentRestrictions := Seq(
  Tags.limit(Tags.Test, 1)
)

it usually pass, and deadlock is reproduced in one out of ten runs. Anyway, without it, almost each run stoped with this deadlock.

It doesn't consume CPU and when I've attached by gdb to test program it stuck like this:

(gdb) thread apply all bt

Thread 2 (thread 117065):
#0  _thread_sys_recvfrom () at /tmp/-:2
#1  0x352c99796082b033 in ?? ()
#2  0x00000064e1096172 in _libc_recvfrom_cancel (fd=Unhandled dwarf expression opcode 0xa3
) at /usr/src/lib/libc/sys/w_recvfrom.c:28
#3  0x00000062a1e2a786 in _SM32java.net.AbstractPlainSocketImplD4readLAb_iiiEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#4  0x00000062a1e31620 in _SM26java.net.SocketInputStreamD4readLAb_iiiEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#5  0x00000062a20b0e8f in _SM23java.io.DataInputStreamD14rebufferImpl$1iiiEPT23java.io.DataInputStream ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#6  0x00000062a20b1034 in _SM23java.io.DataInputStreamD7readIntiEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#7  0x00000062a20f4bac in _SM41scala.scalanative.testinterface.NativeRPCD4loopiEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#8  0x00000062a20f6c2f in _SM41scala.scalanative.testinterface.TestMain$D4mainLAL16java.lang.String_uEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#9  0x00000062a20f6825 in _SM40scala.scalanative.testinterface.TestMainD4mainLAL16java.lang.String_uEo ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#10 0x00000062a1e5ed66 in main () from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test

Thread 1 (thread 301394):
#0  futex () at /tmp/-:2
#1  0x757a9499a3f94aa3 in ?? ()
#2  0x00000064e10dc42a in _rthread_cond_timedwait (cond=0x6525557af0, mutexp=0x65d9375338, abs=0x64f20ddda0) at /usr/src/lib/libc/thread/rthread_cond.c:102
#3  0x00000062a211db98 in _SM26java.lang.impl.PosixThreadD4parkjzuEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#4  0x00000062a1e458d7 in _SM33java.util.concurrent.ForkJoinPoolD44java$util$concurrent$ForkJoinPool$$awaitWorkL43java.util.concurrent.ForkJoinPool$WorkQueueiEO () from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#5  0x00000062a1e47f15 in _SM41java.util.concurrent.ForkJoinWorkerThreadD3runuEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#6  0x00000062a1eb5839 in _SM39scala.scalanative.runtime.NativeThread$D16threadEntryPointL38scala.scalanative.runtime.NativeThreaduEPT39scala.scalanative.runtime.NativeThread$ () from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#7  0x00000062a1eb5b2b in _SM39scala.scalanative.runtime.NativeThread$D24$anonfun$threadRoutine$1L28scala.scalanative.unsafe.PtrL28scala.scalanative.unsafe.PtrEPT39scala.scalanative.runtime.NativeThread$ () from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#8  0x00000062a1eb5b98 in _SM49scala.scalanative.runtime.NativeThread$$$Lambda$1G17$extern$forwarder ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#9  0x00000062a222e1c8 in ProxyThreadStartRoutine () from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#10 0x0000006511df7f42 in _rthread_start (v=Unhandled dwarf expression opcode 0xa3
) at /usr/src/lib/librthread/rthread.c:96
#11 0x00000064e10aff4a in __tfork_thread () at /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:87
#12 0x00000064e10aff4a in __tfork_thread () at /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:87
Previous frame identical to this frame (corrupt stack?)
(gdb)
@LeeTibbert
Copy link
Contributor

@catap,

I am sorry that you are experiencing this hang. I know from experience that hangs inside a large
test set can be frustrating.

I am off-line for most of the next two months, but let's see if we can make some progress.

I am guessing that the operating system is OpenBSD? 64bit?

You may have way more experience chasing network and thread problems than I,
so let`s put our heads together.

At the root of a decision tree, I think there are three primary possibilities. In my guestimate
order of probability:

  1. _libc_recvfrom_cancel is legitimately waiting in the recvfrom portion for a message which will never come: i.e.
    never sent or dropped message. SN did the transform from the read call to recvfrom.

  2. _libc_recvfrom_cancel has been interrupted, possibly by the indicated or another
    _rthread_cond_timedwait and is stuck inside the _cancel portion of _libc_recvfrom_cancel

  3. The java.lang.impl.PosixThreadD4parkjzuEO framework is messed up.

I took a quick look at the OpenBSD $OpenBSD: w_recvfrom.c,v 1.1 2016/05/07 19:05:22 code.
and, due to macros in the code, could not distinguish cases 1 & 2. If we are indeed talking about
OpenBSD, I'd have to trace into the macros.


Something quick to try, worth a 1/2 hour time, not days.

Focusing on _rthread_cond_timedwait:

  1. If the cond timeout value is small, say a few seconds or less, for debugging try increasing to minutes, say 5.
    The wild thrash is to give reads time to complete if they are "ever" going to do so. This trial does
    not entirely rule out the timer having fired and the signal(?) to recvfrom not arriving or causing a hang.
    A decrease in apparent failure rate would give evidence that case 1 is happening.

  2. If the cond timeout value is large, say days or months, try setting it down to, say, 5 times the time
    in seconds you expect your computation-under-test to take. If case 1 above is happening, this should
    increase the apparent failure rate but also give evidence that case 1 is happening.

Focusing on 'recvfrompart_libc_recvfrom_cancel`:

  1. Try using the javalib Socket setSoTimeout(int timeout) method to set the os fd socket timeout to
    a few multiples of your expected computation-under-test time. The argument is Java-style milliseconds
    so one must be careful with seconds vs milliseconds.

    This (debug) approach should put any 'timeout' handling down at the OS level and avoid any
    userspace (quasi-) signal handling and/or C signal blocking and/or C signal insufficient-strength.
    (I personally generally design to avoid C signals because of their quirks features. I am told
    that on some operating systems, one must enable socket interruptibility-via-signals and/or
    avoid some other thread adversely manipulating the signal mask).


Is the intent that this code run on FreeBSD? macOS? Linux?

I am trying to avoid suggesting code changes until we have more evidence pointing to a culprit.
IIUC, your interest is in the underlying crypto code and not the socket I/O and or test framework
part and that you are kindly reporting what appears to be sand in the gears of SN.

Complexity I am trying to avoid

In describing this, I do not mean to offend either you or the code under discussion.

Having one thread cancel I/O in another is a well know problem. Very few people get it right and,
I must confess, I am not one of those people.

The usual solution is to have the read thread do its own timeout using kqueue() EVFILT_TIMER on
*BSD & macOS and epoll with a timerfd on Linux (yes os specific code). Setting SO_Timeout on the
os socket, as suggested for debug above, is a more historical approach. kqueue() allows the
application to attempt some recovery: more flexibility; more complexity.

javalib/src/main/scala/java/lang/process/UnixProcessGen2.scala gives a somewhat close
example for process, instead of timer. Sorry the code is ugly/inefficient but effective.
A timedRead implementation would have to kqueue for the os socket id (fetched from the Java Socket)
using EVFILT_TIMER.

@catap
Copy link
Contributor Author

catap commented Mar 24, 2024 via email

@WojciechMazur
Copy link
Contributor

WojciechMazur commented Mar 24, 2024

Have you maybe observed it also on 0.5.0-RC1? In the RC2 we've added #3827 which retries the read upon interruption. It was added, becouse attaching to the test-runner was leading the stopping it's execution - it was throwing unhandled SocketException, but I don't see how we might have received an interrupt when running tests.

The java.lang.impl.PosixThreadD4parkjzuEO framework is messed up.

I don't think that might be the case. ForkJoinPool WorkerThreads are daemons and can scale down to if unused for longer time. The 1 thread we've seen in stack dump is probably just the last alive worker waiting for tasks

I'd probably bet on the missing message on either JVM or test-runner side

I'm not sure if might be related, but just right now Windows CI has failed due to timeout out on Windows, probably by being stuck in the same way. https://github.com/scala-native/scala-native/actions/runs/8410284069/job/23028532861?pr=3849
What's really interesting is this line with java.io.IOException: Could not send the packet to the client. You can check if maybe there is some similiar exception in the outputs of your tests.

@catap
Copy link
Contributor Author

catap commented Mar 25, 2024 via email

@catap
Copy link
Contributor Author

catap commented Mar 28, 2024

And I have a pice of strange news. I have achieved the same deadlock without scaltests:

Program received signal SIGINT, Interrupt.
[Switching to thread 602537]
kevent () at /tmp/-:2
2       /tmp/-: No such file or directory.
        in /tmp/-
(gdb) thread apply all bt

Thread 5 (thread 238385):
#0  futex () at /tmp/-:2
#1  0x343eaad106b9bfd2 in ?? ()
#2  0x00000898c047151a in _rthread_cond_timedwait (cond=0x898ab5b36c0, mutexp=0x899e2c55eb8, abs=0x0) at /usr/src/lib/libc/thread/rthread_cond.c:102
#3  0x00000896a511a896 in _SM26java.lang.impl.PosixThreadD4parkjzuEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/sandbox/native/target/scala-3.4.0/sandbox
#4  0x00000896a5000b27 in _SM33java.util.concurrent.ForkJoinPoolD9awaitWorkL43java.util.concurrent.ForkJoinPool$WorkQueueiEPT33java.util.concurrent.ForkJoinPool () from /home/catap/Projects/Kcrypt/cryptosystem/sandbox/native/target/scala-3.4.0/sandbox
#5  0x00000896a5002845 in _SM41java.util.concurrent.ForkJoinWorkerThreadD3runuEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/sandbox/native/target/scala-3.4.0/sandbox
#6  0x00000896a5071809 in _SM39scala.scalanative.runtime.NativeThread$D16threadEntryPointL38scala.scalanative.runtime.NativeThreaduEPT39scala.scalanative.runtime.NativeThread$ () from /home/catap/Projects/Kcrypt/cryptosystem/sandbox/native/target/scala-3.4.0/sandbox
#7  0x00000896a5071abb in _SM39scala.scalanative.runtime.NativeThread$D24threadRoutine$$anonfun$1L28scala.scalanative.unsafe.PtrL28scala.scalanative.unsafe.PtrEPT39scala.scalanative.runtime.NativeThread$ () from /home/catap/Projects/Kcrypt/cryptosystem/sandbox/native/target/scala-3.4.0/sandbox
#8  0x00000896a5071b28 in _SM49scala.scalanative.runtime.NativeThread$$$Lambda$1G17$extern$forwarder ()
   from /home/catap/Projects/Kcrypt/cryptosystem/sandbox/native/target/scala-3.4.0/sandbox
#9  0x00000896a51b6028 in ProxyThreadStartRoutine () from /home/catap/Projects/Kcrypt/cryptosystem/sandbox/native/target/scala-3.4.0/sandbox
#10 0x00000899a3997142 in _rthread_start (v=Unhandled dwarf expression opcode 0xa3
) at /usr/src/lib/librthread/rthread.c:96
#11 0x00000898c043082a in __tfork_thread () at /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:87
#12 0x00000898c043082a in __tfork_thread () at /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:87
Previous frame identical to this frame (corrupt stack?)
2       in /tmp/-
(gdb) 

@catap
Copy link
Contributor Author

catap commented Mar 31, 2024

@WojciechMazur I can reproduce the deadlock by using scalanative from a06f46d, but now I do have only one thread left:

Thread 1 (thread 483794):
#0  _thread_sys_recvfrom () at /tmp/-:2
#1  0x6aef61975ac7756d in ?? ()
#2  0x00000637bdb71872 in _libc_recvfrom_cancel (fd=Unhandled dwarf expression opcode 0xa3
) at /usr/src/lib/libc/sys/w_recvfrom.c:28
#3  0x00000635b6ff9336 in _SM32java.net.AbstractPlainSocketImplD4readLAb_iiiEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#4  0x00000635b70001d0 in _SM26java.net.SocketInputStreamD4readLAb_iiiEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#5  0x00000635b727f4cf in _SM23java.io.DataInputStreamD14rebufferImpl$1iiiEPT23java.io.DataInputStream ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#6  0x00000635b727f674 in _SM23java.io.DataInputStreamD7readIntiEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#7  0x00000635b72c31bb in _SM41scala.scalanative.testinterface.NativeRPCD4loopiEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#8  0x00000635b72c52bf in _SM41scala.scalanative.testinterface.TestMain$D4mainLAL16java.lang.String_uEO ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#9  0x00000635b72c4eb5 in _SM40scala.scalanative.testinterface.TestMainD4mainLAL16java.lang.String_uEo ()
   from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
#10 0x00000635b702e1ab in main () from /home/catap/Projects/Kcrypt/cryptosystem/native/target/scala-2.12/cryptosystem-test
(gdb)

@catap
Copy link
Contributor Author

catap commented Mar 31, 2024

I've tried to revert #3827 and deadlock still here.

@catap
Copy link
Contributor Author

catap commented Apr 1, 2024

Interesting observation with:

  .nativeSettings(nativeConfig ~= { c =>
    import scala.scalanative.build._
    c.withMode(Mode.releaseFast)
  })

deadlock is quite difficult to reproduce in my small test case, but without this block, each attempt leads to deadlock.

@catap
Copy link
Contributor Author

catap commented Apr 1, 2024

Extracted somehow related issue to dedicated ticket: #3859

@catap
Copy link
Contributor Author

catap commented Apr 3, 2024

Future investigation.

From ktrace/kdump point of view I see the last syscalls:

 97560 cryptosystem-test 0.000026 RET   thrkill 0
 97560 cryptosystem-test 0.000006 CALL  sigprocmask(SIG_UNBLOCK,0<>)
 97560 cryptosystem-test 0.000003 RET   sigprocmask 0x40004<SIGQUIT|SIGCONT>
 97560 cryptosystem-test 0.001741 CALL  sendto(3,0x36288a61848,0x4,0x400<MSG_NOSIGNAL>,0,0)
 97560 cryptosystem-test 0.000038 GIO   fd 3 wrote 4 bytes
       "\0\0\0""
 97560 cryptosystem-test 0.000009 RET   sendto 4
 97560 cryptosystem-test 0.000016 CALL  sendto(3,0x36288bf94f8,0x44,0x400<MSG_NOSIGNAL>,0,0)
 97560 cryptosystem-test 0.000021 GIO   fd 3 wrote 68 bytes
       "\0\a\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\^W\0\^[\0[\0003\0002\0m\0-\0 \0w\0h\0e\0n\0 \0o\0b\0j\0e\0c\0t\0s\0\^[\0[\0000\0m"
 97560 cryptosystem-test 0.000005 RET   sendto 68/0x44
 97560 cryptosystem-test 0.000035 CALL  sendto(3,0x36288a61848,0x4,0x400<MSG_NOSIGNAL>,0,0)
 97560 cryptosystem-test 0.000015 GIO   fd 3 wrote 4 bytes
       "\0\0\0n"
 97560 cryptosystem-test 0.000010 RET   sendto 4
 97560 cryptosystem-test 0.000010 CALL  sendto(3,0x36288bfa428,0xdc,0x400<MSG_NOSIGNAL>,0,0)
 97560 cryptosystem-test 0.000040 GIO   fd 3 wrote 220 bytes
       "\0\^D\0\0\0\0\0\0\0\0\0\0\0&\0p\0t\0.\0k\0c\0r\0y\0.\0c\0r\0y\0p\0t\0o\0s\0y\0s\0t\0e\0m\0.\0f\0e\0a\0t\0u\0r\0e\0s\0.\0D\0a\0t\0a\0T\0e\0s\0t\0\^B\0\0\0\0\0\^S\0o\0r\0g\0.\0s\0c\0a\0l\0a\0t\0e\0s\0t\0.\0S\0u\0i\0t\0e\0\^A\0\^B\0\0\0\^Y\0T\0o\0o\0 \0b\0i\0g\
        \0 \0d\0a\0t\0a\0 \0w\0h\0e\0n\0 \0o\0b\0j\0e\0c\0t\0s\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0k"
 97560 cryptosystem-test 0.000009 RET   sendto 220/0xdc
 97560 cryptosystem-test 0.000146 CALL  futex(0x36236219010,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
 97560 cryptosystem-test 0.000013 RET   futex 1
 97560 cryptosystem-test 0.000023 RET   futex 0
 97560 cryptosystem-test 0.000011 CALL  sendto(3,0x36288a61848,0x4,0x400<MSG_NOSIGNAL>,0,0)
 97560 cryptosystem-test 0.000017 CALL  futex(0x36236219010,0x81<FUTEX_WAIT|FUTEX_PRIVATE_FLAG>,3,0,0)
 97560 cryptosystem-test 0.000012 GIO   fd 3 wrote 4 bytes
       "\0\0\0\r"
 97560 cryptosystem-test 0.000009 RET   sendto 4
 97560 cryptosystem-test 0.000011 CALL  sendto(3,0x36288bfd458,0x1a,0x400<MSG_NOSIGNAL>,0,0)
 97560 cryptosystem-test 0.000014 GIO   fd 3 wrote 26 bytes
       "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\^D\0\0\0\0\0\0\0\0"
 97560 cryptosystem-test 0.000009 RET   sendto 26/0x1a
 97560 cryptosystem-test 0.000015 CALL  futex(0x3619a93d2e0,0x81<FUTEX_WAIT|FUTEX_PRIVATE_FLAG>,3,0x3618282e4f0,0)
 97560 cryptosystem-test 0.000009 STRU  struct timespec { 59.999714661 }
 97560 cryptosystem-test 60.006360 RET   futex -1 errno 60 Operation timed out
 97560 cryptosystem-test 0.000144 CALL  futex(0x36236219010,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
 97560 cryptosystem-test 0.000010 RET   futex 1
 97560 cryptosystem-test 0.000004 RET   futex 0
 97560 cryptosystem-test 0.000018 CALL  kbind(0x3618282e558,24,0x30072d99c45383c)
 97560 cryptosystem-test 0.000040 CALL  futex(0x36236219d20,0x81<FUTEX_WAIT|FUTEX_PRIVATE_FLAG>,0,0x361a80cedf0,0)
 97560 cryptosystem-test 0.000006 RET   kbind 0
 97560 cryptosystem-test 0.000004 STRU  struct timespec { 59.999147605 }
 97560 cryptosystem-test 0.000008 CALL  kbind(0x3618282e558,24,0x30072d99c45383c)
 97560 cryptosystem-test 0.000061 RET   kbind 0
 97560 cryptosystem-test 0.000012 CALL  kbind(0x3618282e608,24,0x30072d99c45383c)
 97560 cryptosystem-test 0.000017 RET   kbind 0
 97560 cryptosystem-test 0.000013 CALL  __threxit(0x36193602830)
 97560 cryptosystem-test 60.009837 RET   futex -1 errno 60 Operation timed out
 97560 cryptosystem-test 0.000048 CALL  __threxit(0x36193602030)
 97560 cryptosystem-test 188.891690 PSIG  SIGINT SIG_DFL

where 3rd column is releative time since the previus event.

The last signal was from my C-c to abort the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants