Porting `SuspendAllThreads` from the NativeAOT to CoreCLR. #101782

VSadov · 2024-05-01T23:49:52Z

A step towards making EE suspension support similar between NativeAOT and CoreCLR and possibly eventually share.

The main goal of this change is to port SuspendAllThreads and ResumeAllThreads from the NativeAOT. More steps will follow on this path, but this looks like a good state to commit the changes.

The CoreCLR implementation of SuspendAllThreads is now roughly equivalent to the one in NativeAOT modulo different helpers to iterate threads or to figure if they are in coop mode.

This also introduces Thread::Hijack entry point through which SuspendAllThreads nudges threads into preempt mode. The implementation of Thread::Hijack is very different between runtimes right now. Unifying the design of Thread::Hijack will be the goal of further changes.

Introducing SuspendAllThreads pulled a thread of other required or good-to-have changes:

g_pGCSuspendEvent is now gone together with PING_JIT_TIMEOUT
The timeout for the event was 1 millisecond, which is too long. What's worse is that it could still take up to 16 milliseconds, depending on OS, to timeout if threads need to be re-hijacked. We are looking for sub-millisecond timings for suspension. Spending 16 milliseconds per hijack iteration could introduce really bad outliers.
TS_GCSuspendPending thread state is removed. It was a redundant way to specify whether a coop thread needs to be stopped for GC. Having more than one way only brings confusion and concerns about ordering (what needs to be set/unset/checked, in what order).
We now trap threads on preempt->coop transition only.
In addition to suspension trapping, CoreCLR has other reasons to trap threads (ThreadAbort, Debugger...). The checks for those conditions were done either on transition to coop or on transition from coop - for no good reason. A program will see the same number of forward and reverse coop transition, so trapping at either edge would work, except that preempt->coop must have a trap for GC purposes and coop->preemt does not.
In short - this removes RareEnablePreemptiveGC, because having RareDisablePreemptiveGC is enough. Exit to preemptive mode is now an unconditional set of m_fPreemptiveGCDisabled.
trap flag for EE suspension is now a single bit that is set/unset atomically and also is the "source of truth" on whether coop threads need to be suspended.
We still have the trap counter for scenarios like ThreadAbort as there could be multiple at a time (NativeAOT may need something similar if ThreadAbort is implemented). There could be only one suspension at a time though, thus it is possible and useful to have a single dedicated bit for that.
avoid posting new suspension signals when previous one is still in progress.
This was done on Windows, but not on Unix. Redundant signals can be harmful on Unix too.

VSadov · 2024-05-01T23:51:40Z

CC @tommcdon - I may need to run this through debugger tests. I'll contact you separately for that.

mangod9 · 2024-05-02T01:08:18Z

src/coreclr/vm/threadsuspend.cpp

+
+// exponential spinwait with an approximate time limit for waiting in microsecond range.
+// when iteration == -1, only usecLimit is used
+void SpinWait(int iteration, int usecLimit)


is this existing logic or a new mechanism? Changing the spinwait duration has shown to impact startup when DATAS is enabled.

This is not new, it is copied from NativeAOT. It is not related to spinwaits used by GC in any way. Here we just need to keep the thread that performs the suspension busy for a few microseconds. System timers do not offer pauses with such granularity, not in a portable way, anyways.

What is going on here:

We set a flag telling threads to suspend themselves. (post a frame suitable for a stackwalk start and block on an event).

Most threads do some kind of allocations or calls into runtime, so they will notice the flag and suspend.

Some threads may not be doing that (could be computing something in a loop), so they need to be hijacked.

Hijack may either catch a thread in interruptible code (then we are done with the thread) or hijack the return address, so that when a thread returns from the current call, it will suspend itself.

Hijack typically leads to thread suspension, but not always (a thread may go deeper into call tree), so we might need to redo the hijack a few times. We should eventually "corner" the thread as hijacked return will be moving only lower in the call tree with every try.
Here is a conundrum though - after hijacking we must let the thread to run for a while, so it has a chance to observe what we did and suspend itself. We can't be too aggressive with this. If we keep interrupting the thread to check what is happening, nothing will happen. So after a hijack cycle we back off for a few microseconds and then check if we need to hijack again.

We will also increase the time we give to the thread with every iteration (to make sure we are not starving it with our interruptions), but up to a limit.

Timings here are derived from the general pause expectations - 1/60 second (16 msec) pause could be perceptible in 60fps animation, 1/15 second is certainly noticeable in interactive apps. But what we do here is just the suspension part, we need to leave most of that time for the GC to run. While we can't guarantee the upper bound, we strive for suspension to happen in sub-millisecond time.
Thus the pauses between retries are measured in microseconds.

VSadov · 2024-05-02T16:18:09Z

For the debugger interaction this should work the same as before, unless I messed up something while moving code around.

tommcdon · 2024-05-02T17:29:16Z

Adding @kouvel to review IsInForbidSuspendForDebuggerRegion refactoring changes

VSadov · 2024-05-02T18:40:24Z

src/coreclr/nativeaot/Runtime/thread.cpp

@@ -328,14 +328,6 @@ bool Thread::IsGCSpecial()
    return IsStateSet(TSF_IsGcSpecialThread);
 }

-bool Thread::CatchAtSafePoint()


This was dead code already. Noone called this.

VSadov · 2024-05-02T18:46:01Z

src/coreclr/vm/i386/asmhelpers.asm

@@ -394,29 +393,6 @@ endif
        retn    8
 _CallJitEHFinallyHelper@8 ENDP

-;-----------------------------------------------------------------------


We stopped doing thread trapping on transitions from coop in JIT helpers a while ago.
It looks like x86 stubs were left behind and were still doing it.

noahfalk

I think this looks OK but given the complexity of thread suspension I wouldn't feel particularly confident that I would catch issues :)

Its possible that you are going to find subtle dependencies the debugger had on suspending during the coop->preempt transition during your testing but I am not aware of any explicit dependency.

noahfalk · 2024-05-03T07:19:39Z

src/coreclr/vm/threadsuspend.cpp

-    if (ThreadStore::HoldingThreadStore(this))
+    // A thread that performs GC may switch modes inside GC and come here.
+    // We will not try suspending a thread that is responsible for the suspension.
+    if (this == ThreadSuspend::GetSuspensionThread())


Is there some case where a thread is set as the suspension thread and it doesn't hold the TSL? I am wondering if we need this check when we've got the TSL check below.

Is there some case where a thread is set as the suspension thread and it doesn't hold the TSL? I am wondering if we need this check when we've got the TSL check below.

The difference between these two checks is that this one is quite understandable - a thread that performs suspension certainly does not want itself being blocked. Places that perform mode switches (like inside GC) could specialcase the suspending thread, but it is easier to just handle the scenario here.

The one below seems a bit more dangerous. That is a random thread holding TSL and trying to get into coop mode. If such scenario happens by accident and then the thread allocates, causes GC... etc, not sure what would happen.
I initially had an assert instead of a check, but the assert was hit in tests, so I changed it to a condition. I am not happy about that. Maybe there is no way around this and some scenarios must do it (very carefully), but it seems fragile.

You are right though - the next check that tests for TSL ownership subsumes the check for thread driving the suspension. I guess, I can remove the this == ThreadSuspend::GetSuspensionThread() check for now.

I will try to follow up separately from this PR and see if we can avoid holding TSL in coop mode.

I'm fine if you want to keep the check or convert it to an assert inside the TSL check. Either way comments might be nice to preserve the context you just described here in the code. Thanks!

kouvel · 2024-05-06T18:45:43Z

Adding @kouvel to review IsInForbidSuspendForDebuggerRegion refactoring changes

The refactoring looks fine to me.

kouvel

LGTM, and looks mostly similar to before, though I may have missed some subtleties.

VSadov · 2024-05-10T04:12:21Z

Thanks!!

VSadov added the area-VM-coreclr label May 1, 2024

VSadov requested a review from MichalStrehovsky as a code owner May 1, 2024 23:49

dotnet-policy-service bot assigned VSadov May 1, 2024

mangod9 reviewed May 2, 2024

View reviewed changes

build-analysis bot mentioned this pull request May 2, 2024

System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanScalarDestination_SpecialValues fails #101721

Closed

tommcdon requested review from noahfalk and a team May 2, 2024 12:48

VSadov commented May 2, 2024

View reviewed changes

noahfalk approved these changes May 3, 2024

View reviewed changes

VSadov added 17 commits May 4, 2024 14:58

port suspension algo from NativeAOT

abddacb

PING_JIT_TIMEOUT gone

b5091cc

CatchAtSafePoint is always opportunistic

afa8fb6

current

08a90e8

removed RareEnablePreemptiveGC

bb4be45

cleanup RareDisablePreemptiveGC

ec11b18

fix for x86

17a83c6

factored out Thread::Hijack

e9be64b

fix build for non-x64 windows

4ca217d

assert noone holds TSL into coop mode

fb2c654

activation safety check is always for the current thread

1fb68ea

undo comment

e5ce033

PulseGCMode should not check for CatchAtSafePointOpportunistic

c26364a

not disabling preempt while holding TSL

1dbd95b

tweak

cf19edd

dead assert

ca6e83e

tweak RareDisablePreemptiveGC

15acfe2

VSadov added 8 commits May 4, 2024 14:58

RareDisablePreemptiveGC avoid GetSuspensionThread()

c1c20f7

updated Thread::Hijack

5c02723

fix typo

827d4de

allow coop mode while holding TS lock

8b9b7a3

Refactored into SuspendAllThreads/ResumeAllThreads

bad07f4

SetThreadTrapForSuspension

c9ad7ff

deleted TS_GCSuspendPending

96ab9f5

tweaks

48e1c78

VSadov force-pushed the suspAllThrCore branch from a38b54a to 48e1c78 Compare May 5, 2024 01:42

PR feedback

1688024

kouvel approved these changes May 6, 2024

View reviewed changes

build-analysis bot mentioned this pull request May 7, 2024

Test failure in System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanDestinationFunctions_SpecialValues #101731

Closed

tommcdon approved these changes May 10, 2024

View reviewed changes

VSadov merged commit 00a8973 into dotnet:main May 10, 2024
88 of 90 checks passed

VSadov deleted the suspAllThrCore branch May 10, 2024 04:12

VSadov mentioned this pull request May 11, 2024

Allow async interruptions on safepoints #95565

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Porting `SuspendAllThreads` from the NativeAOT to CoreCLR. #101782

Porting `SuspendAllThreads` from the NativeAOT to CoreCLR. #101782

VSadov commented May 1, 2024 •

edited

VSadov commented May 1, 2024

mangod9 May 2, 2024

VSadov May 2, 2024

VSadov May 2, 2024 •

edited

VSadov commented May 2, 2024

tommcdon commented May 2, 2024

VSadov May 2, 2024

VSadov May 2, 2024

noahfalk left a comment

noahfalk May 3, 2024

VSadov May 3, 2024 •

edited

noahfalk May 3, 2024

kouvel commented May 6, 2024

kouvel left a comment

VSadov commented May 10, 2024

Porting SuspendAllThreads from the NativeAOT to CoreCLR. #101782

Porting SuspendAllThreads from the NativeAOT to CoreCLR. #101782

Conversation

VSadov commented May 1, 2024 • edited

VSadov commented May 1, 2024

mangod9 May 2, 2024

Choose a reason for hiding this comment

VSadov May 2, 2024

Choose a reason for hiding this comment

VSadov May 2, 2024 • edited

Choose a reason for hiding this comment

VSadov commented May 2, 2024

tommcdon commented May 2, 2024

VSadov May 2, 2024

Choose a reason for hiding this comment

VSadov May 2, 2024

Choose a reason for hiding this comment

noahfalk left a comment

Choose a reason for hiding this comment

noahfalk May 3, 2024

Choose a reason for hiding this comment

VSadov May 3, 2024 • edited

Choose a reason for hiding this comment

noahfalk May 3, 2024

Choose a reason for hiding this comment

kouvel commented May 6, 2024

kouvel left a comment

Choose a reason for hiding this comment

VSadov commented May 10, 2024

Porting `SuspendAllThreads` from the NativeAOT to CoreCLR. #101782

Porting `SuspendAllThreads` from the NativeAOT to CoreCLR. #101782

VSadov commented May 1, 2024 •

edited

VSadov May 2, 2024 •

edited

VSadov May 3, 2024 •

edited