Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857

Closed
lambdageek opened this issue Jul 26, 2022 · 3 comments · Fixed by #73305
Closed

[wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) #72857

lambdageek opened this issue Jul 26, 2022 · 3 comments · Fixed by #73305
Assignees
Labels
arch-wasm WebAssembly architecture area-VM-threading-mono
Projects
Milestone

Comments

@lambdageek
Copy link
Member

Seen here:

#72275 (comment)

image

Repro (probably):

Build the runtime with /p:WasmEnableThreads=true; build and run the browser-mt-eventpipe sample.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 26, 2022
@lambdageek lambdageek added area-Diagnostics-mono arch-wasm WebAssembly architecture and removed area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Jul 26, 2022
@lambdageek lambdageek added this to Incoming in wasm-mt via automation Jul 26, 2022
@ghost
Copy link

ghost commented Jul 26, 2022

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Seen here:

#72275 (comment)

image

Repro (probably):

Build the runtime with /p:WasmEnableThreads=true; build and run the browser-mt-eventpipe sample.

Author: lambdageek
Assignees: -
Labels:

arch-wasm, area-Diagnostics-mono

Milestone: -

@lambdageek lambdageek modified the milestones: 8.0.0, 7.0.0 Jul 26, 2022
@lambdageek lambdageek moved this from Incoming to In progress in wasm-mt Jul 30, 2022
@lambdageek lambdageek self-assigned this Jul 30, 2022
@lambdageek
Copy link
Member Author

lambdageek commented Jul 30, 2022

Ok, this is actually due to the thread suspend changes in 6726fae

In that PR, we added logic to run a second phase of STW on full coop suspend if mono_threads_platform_stw_defer_initial_suspend returns TRUE for some thread.

The problem is that on hybrid suspend (where two-phase STW) was developed, the assumptions were:

  1. The first phase always decides to return MONO_THREAD_BEGIN_SUSPEND_NEXT_PHASE after first doing a "cordial" suspension request. When the cordial suspend request happens, the thread is put into a "suspend requested" state, or if it was already suspended its suspend count is incremented. Only after that do we try to decide if the thread is sufficiently suspended or if we need to run the second phase and preempt the thread.
  2. When we run the second phase, we don't do the "cordial" suspend - we do "peek and preempt". The key thing about "peek and preempt" is that if the thread is already suspended by the time we "peek", we don't increment the suspend count, and we don't try to force it to suspend.

In contrast when we added the second suspend phase for platform_stw_defer_initial_suspend the goal was to avoid putting the main browser thread into the "suspend requested" until after we attempted to suspend all the other threads (so that the main thread could continue running queued syscalls on their behalf).

The problem is that when we decide to defer the main thread, when we run the second phase, we again run the "cordial" suspend. which again requests suspension of every thread. Which for self-suspended threads (ie well behaving coop threads) ends up incrementing their suspend count a second time.

Then when we run "resume" it only decrements the suspension count once, and so:

  1. the victim thread is stuck in suspend limbo (since it's suspend count never returns to zero)
  2. after enough STWs (due to EventPipe sampling, or due to GCs) we end up wrapping the suspend count (it's an 8-bit signed int) which goes126 -> 127 -> -128 and we end up asserting !(suspend_count > 0)

I'm not sure what the solution is yet. Probably we need some version of "peek and preempt" for full coop suspend (so... without "and preempt"). "peek, no preempt" won't help us here - that logic depends on the first phase always requesting a suspend of every thread. in contrast here we need to partition the threads - those where we request suspension in the first phase and those where we request suspension in the second phase. and neither partition should be acted on in the "other" phase. It would also be nice if we could avoid adding more state to MonoThreadInfo. Maybe we change mono_threads_platform_stw_defer_initial_suspend to return some kind of a suspend priority instead of a boolean, and run k iterations from 0 to the max seen (logically lowest) priority, where in each iteration i we only suspend threads with priority i. (and for WASM the priorities would be 0 (not the main thread) and 1 (main thread) - and we can compute them cheaply on the fly by calling mono_threads_wasm_is_browser_thread).


I'm pretty certain this is not a bug in the hybrid suspend idea of two-phase suspend. That is, we haven't been sitting on a bug for 4 years since 2f6379d. This is purely a bug in the recent code that attempted to add a second phase to full coop on WASM.

@lambdageek lambdageek changed the title [wasm-ep] Sampling thread requests resume for a thread with !(suspend_count > 0) [wasm-mt] Sampling thread requests resume for a thread with !(suspend_count > 0) Jul 30, 2022
@lambdageek
Copy link
Member Author

Not sure what I was thinking in 6726fae. Coop suspend is really mono_os_sem_wait which boils down to __timedwait_cp which processes queued calls. So it should be ok to suspend the main thread in the first phase together with other threads

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 3, 2022
lambdageek added a commit to lambdageek/runtime that referenced this issue Aug 4, 2022
Each call to begin_suspend_request_suspension_cordially may increment
the suspend count.  But in STW we only resume each thread once.

In 6726fae we added a second phase of
STW to full coop on WebAssembly in order to suspend the browser thread
after all the worker threads have been suspended in order to avoid
some deadlocks that rely on the main thread continuing to process
async work on behalf of the workers before they reach a safepoint.

The problem is that for worker threads we could end up calling
begin_suspend_request_suspension_cordially twice. If the thread
self-suspends after the first call, the second call will increment the
suspend count.  As a result, when we restart the world, the thread
will decrement its suspend count, but still stay suspended.  Worse, on
the _next_ stw, we will increment the suspend count two more times and
decrement once on the next restart, etc.

Eventually the thread will overflow the suspend counter and we will
assert `!(suspend_count > 0)`.

Also change `THREAD_SUSPEND_COUNT_MAX` to `0x7F` (from `0xFF`) - the
suspend count is signed, so the roll-over from 127 to -128 is where we
should assert

Fixes dotnet#72857
lambdageek added a commit that referenced this issue Aug 5, 2022
…r; fix merge (#73305)

Grab bag of threading fixes:

1. Remove the coop two-phase transition (Partially revert 6726fae).  This was based on a misunderstanding of how Emscripten works: when the main thread is blocked in a concurrency primitive like `sem_wait`, it is still processing queued calls from other threads.  So there is no need to first suspend the worker threads and then suspend the main thread.  The implementation of two-phase suspend had a bug where it would suspend worker threads twice, making the suspend increase by 2. Since resume only decremented the count by 1, this lead to a suspend count overflow.  Fixes #72857 
2. Once the diagnostic server attaches to the runtime, switch it to GC Safe mode when it returns to JavaScript.  That is, while the diagnostic server is reacting to messages in the JS event loop, it is considered suspended by the runtime.  When it calls into C, switch to GC Unsafe (which may block if there's a STW happening).  Add thread state transitions when we come back to C, and when we wait.
3. Mark the wasm diagnostic server thread as "no sample; no gc" which means that we don't consider it for STW when there's a GC or a sample profiler active.  This is how we treat utility threads (including the non-wasm diagnostic server thread) on other platforms.
4. Fix a bad signature for `cwraps.mono_wasm_event_pipe_enable` due to a mistake in a previous merge
5. Added a new `browser-threads` sample

---

* [coop] Don't call begin_suspend_request_suspension_cordially twice

Each call to begin_suspend_request_suspension_cordially may increment
the suspend count.  But in STW we only resume each thread once.

In 6726fae we added a second phase of
STW to full coop on WebAssembly in order to suspend the browser thread
after all the worker threads have been suspended in order to avoid
some deadlocks that rely on the main thread continuing to process
async work on behalf of the workers before they reach a safepoint.

The problem is that for worker threads we could end up calling
begin_suspend_request_suspension_cordially twice. If the thread
self-suspends after the first call, the second call will increment the
suspend count.  As a result, when we restart the world, the thread
will decrement its suspend count, but still stay suspended.  Worse, on
the _next_ stw, we will increment the suspend count two more times and
decrement once on the next restart, etc.

Eventually the thread will overflow the suspend counter and we will
assert `!(suspend_count > 0)`.

Also change `THREAD_SUSPEND_COUNT_MAX` to `0x7F` (from `0xFF`) - the
suspend count is signed, so the roll-over from 127 to -128 is where we
should assert

Fixes #72857

* improve thread state machine assertion messages

   include thread states and thread ids where available

* fix typo

* Revert "[coop] Don't call begin_suspend_request_suspension_cordially twice"

This reverts commit 92f52ab7ed1cfaa1a4f66e869a8d9404e066f1b2.

* [threads] Revert coop two-phase STW

Remove mono_threads_platform_stw_defer_initial_suspend

The motivation for it in 6726fae
was unfounded.  There is no need to suspend the main browser thread
after the other threads: suspension on wasm uses `sem_wait` which on
Emscripten on the main thread is implemented using a busy wait
`__timedwait_cp` which processes queued calls.  So even if we suspend
the main thread first, it will still allow other threads in GC Safe to
make progress if they're using syscalls.

* Switch the diagnostics server to GC Safe when returning to JS; set NO_GC flag

The diagnostic server worker spends most of its time in the JS event
loop waiting for messages.  After we attach to the runtime, we need to
switch to GC Safe mode because the diagnostic server may not ever
reach a safepoint (for example if no more DS events arrive).

Conversely, when we call from JS into the C diagnostic server, we need
to enter GC Unsafe mode (and potentially safepoint).

Also mark the diagnostic server threads with the NO_GC flag - this
thread does not manipulate managed objects so it doesn't need to stop
for GC STW.

* cwraps: fix bad signature for mono_wasm_event_pipe_enable

   Mistake from a previous merge

* Add new browser-threads sample

* exclude the browser-threads sample unless wasm threads are enabled

* Update browser-threads sample

* Update src/mono/mono/component/diagnostics_server.c

* Update src/mono/mono/utils/mono-threads-state-machine.c

Co-authored-by: Katelyn Gadd <kg@luminance.org>
wasm-mt automation moved this from In progress to Done Aug 5, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 5, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-wasm WebAssembly architecture area-VM-threading-mono
Projects
Development

Successfully merging a pull request may close this issue.

1 participant