Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasmtime hangs when doing file I/O in a multi threaded app #8392

Closed
dbaeumer opened this issue Apr 17, 2024 · 5 comments
Closed

Wasmtime hangs when doing file I/O in a multi threaded app #8392

dbaeumer opened this issue Apr 17, 2024 · 5 comments
Labels
bug Incorrect behavior in the current implementation that needs fixing

Comments

@dbaeumer
Copy link

dbaeumer commented Apr 17, 2024

Executing a wasm application on wasmtime that has multiple threads where on thread does file I/O results in a hang on the first file I/O operation.

Disclaimer: I am not a Rust expert. But since the application runs correctly as a OS process I assume that the Rust code is OK. I am also not sure if this is a Rust toolchain problem or a wasmtime problem.

Test Case

Clone https://github.com/dbaeumer/wasm-threads.git

Steps to Reproduce

  • npm run build:wasm to build the wasm file
  • npm run run:wasmtime to execute the wasm file in wasmtime

Expected Results

The application doesn't hang on the fs::File::open call

I expect the same result as executing:

npm run build:os
npm run run:os

Actual Results

The application does hang.

Versions and Environment

Rust:

Default host: x86_64-unknown-linux-gnu
rustup home:  /xxxx/.rustup

installed targets for active toolchain
--------------------------------------

wasm32-unknown-unknown
wasm32-wasi
wasm32-wasi-preview1-threads
x86_64-unknown-linux-gnu

active toolchain
----------------

stable-x86_64-unknown-linux-gnu (default)
rustc 1.77.2 (25ef9e3d8 2024-04-09)

Wasmtime: wasmtime-cli 19.0.2 (040ec92 2024-04-11)

Operating system:

Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy

@dbaeumer dbaeumer added the bug Incorrect behavior in the current implementation that needs fixing label Apr 17, 2024
@alexcrichton
Copy link
Member

Using gdb/logging it looks like the main thread is blocked in memory.atomic.wait32(addr=0x104198, expected=0x80000002, timeout=None). Given that I'm not sure that the issue is in Wasmtime and this looks like a wasi-libc issue perhaps? (e.g. something perhaps related to locking in the initialization of files and such.

@tschneidereit
Copy link
Member

@abrown this seems relevant to your interests :)

@abrown
Copy link
Collaborator

abrown commented Apr 26, 2024

I ran @dbaeumer's code today on the latest version of Wasmtime (@dbaeumer, thanks for the great reproducer). I believe the hang is due to incorrect locking in wasi-libc. This is not a Wasmtime issue, but I'll explain what I figured out here:

When running with WASMTIME_LOG=wasmtime_runtime::threads=trace, I see one memory.atomic.wait that is not corresponded elsewhere:

2024-04-26T19:11:26.438492Z TRACE wasmtime_runtime::threads::shared_memory: memory.atomic.wait32(addr=0x103f2c, expected=2147483650, timeout=None)

The stack trace at this instruction looks like:

__wait
__lock
internal_register_preopened_fd_unlocked
__wasilibc_populate_preopens
__wasilibc_find_abspath
__wasilibc_find_relpath_alloc
__wasilibc_find_relpath
std::sys::pal::wasi::fs::open_parent::{{closure}}::h2a891633185b03b0
std::sys::pal::wasi::fs::File::open::h57333908bc41373d
std::fs::OpenOptions::_open::h0ba6508e7f3dab94
std::fs::OpenOptions::open::h917f7ccf95a22000
std::fs::File::open::h2bd1932e28feed10
wasm_threads::main::hd67815b1e283a1b0
core::ops::function::FnOnce::call_once::h0dba6a9e804e86ee
std::sys_common::backtrace::__rust_begin_short_backtrace::h884974562e00ebc8
std::rt::lang_start::{{closure}}::h79de5d68ac140761
std::rt::lang_start_internal::h301b28581f6245a7
std::rt::lang_start::h8217158c4515bb59
__main_void
_start

This points to a problem in internal_register_preopened_fd_unlocked. From the file name, it would appear that no locking should happen here because its parent, __wasilibc_populate_preopens, has already acquired the lock. But, in fact, resize attempts to reacquire it despite it being held in the parent.

Looking at Git blame, I added the additional locking very early on in the wasi-threads implementation effort (#311) and it looks like the mistake comes from not understanding how these preopens work correctly. I believe we can safely remove the lock in resize since it is not used elsewhere but I would like to reproduce end-to-end with the Rust toolchain; @alexcrichton, can you remind me how to use cargo with a modified version of wasi-libc?


EDIT: scratch the Git blame analysis; looking at this more, I think this was actually introduced in a refactoring in #408. @sunfishcode, it probably wasn't clear that resize retakes the lock? But it should be fine to remove it, right?

abrown added a commit to abrown/wasi-libc that referenced this issue Apr 26, 2024
Issue [#8392] in Wasmtime highlights how preopen registration can result
in a hang when compiled for a threaded environment. The problem is that
`internal_register_preopened_fd_unlocked` assumes it will not touch the
global lock because its caller, `__wasilibc_populate_preopens`, already
has taken the lock. Unfortunately, a refactoring in WebAssembly#408 (which
introduces `internal_register_preopened_fd_unlocked`) did not catch that
the `resize` function called internally also takes the global lock. This
change removes that locking in `resize` under the assumption that it
will only be called when the lock is already taken.

[#8392]: bytecodealliance/wasmtime#8392
@alexcrichton
Copy link
Member

@alexcrichton, can you remind me how to use cargo with a modified version of wasi-libc?

The "easiest" way is perhaps to:

  • Check-out wasi-sdk
  • Update/edit the wasi-libc submodule with your changes
  • Build wasi-sdk
  • Set WASI_SDK_PATH and build the Rust target

You can probably also finagle this with various sysroot flags and Rust configuration as well and get away with only building wasi-libc if you'd like to.

@abrown
Copy link
Collaborator

abrown commented May 2, 2024

Ok, here's how I verified that WebAssembly/wasi-libc#491 actually fixes this issue:

  • built wasi-libc with threads support: make THREAD_MODEL=posix
  • in each of the sysroot/* directories, I had to "rename" the target to what Rust believes it to be: ln -s wasm32-wasi-threads wasm32-wasip1-threads
  • with the rust repository cloned, configure config.toml to point to the modified wasi-libc:
    lld = true
    [target.wasm32-wasip1-threads]
    wasi-root = ".../wasi-libc/sysroot"
  • build rustc: ./x.py build --stage 1 --target wasm32-wasip1-threads
  • inform rustup of the new compiler: rustup toolchain link stage1 build/x86_64-unknown-linux-gnu/stage1
  • with @dbaeumer's repository cloned, build the reproducer with the modified rustc and wasi-libc: rustc +stage1 -v --target=wasm32-wasip1-threads src/main.rs
  • run the compiled reproducer with any recent Wasmtime:
    $ wasmtime --dir .::. --wasi threads=y --wasm threads=y main.wasm
    Before file open
    After file open
    hi number 1 from the spawned thread!
    hi number 2 from the spawned thread!
    hi number 3 from the spawned thread!
    hi number 4 from the spawned thread!
    hi number 5 from the spawned thread!
    hi number 6 from the spawned thread!
    hi number 7 from the spawned thread!
    hi number 8 from the spawned thread!
    hi number 9 from the spawned thread!
    Thread finished

Having run these steps, I'm a bit more confident that WebAssembly/wasi-libc#491 does not miss anything and, once that merges, I propose we close this.

abrown added a commit to WebAssembly/wasi-libc that referenced this issue May 2, 2024
Issue [#8392] in Wasmtime highlights how preopen registration can result
in a hang when compiled for a threaded environment. The problem is that
`internal_register_preopened_fd_unlocked` assumes it will not touch the
global lock because its caller, `__wasilibc_populate_preopens`, already
has taken the lock. Unfortunately, a refactoring in #408 (which
introduces `internal_register_preopened_fd_unlocked`) did not catch that
the `resize` function called internally also takes the global lock. This
change removes that locking in `resize` under the assumption that it
will only be called when the lock is already taken.

[#8392]: bytecodealliance/wasmtime#8392
@abrown abrown closed this as completed May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing
Projects
None yet
Development

No branches or pull requests

4 participants