Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate WASM "calling conventions" and passing non-scalar datatypes like strings #106

Open
mildbyte opened this issue Sep 12, 2022 · 1 comment

Comments

@mildbyte
Copy link
Contributor

Currently, our WASM functions only support passing basic types like ints and floats. In order to be able to pass something more complex like strings or datetimes, we want to put them in the WASM memory and point the UDF to it.

We need to figure out what is the most ergonomic way to the function writer to do this. For reference, something like this:

EMSCRIPTEN_KEEPALIVE char* test_string(char* input) {
    int len;
    len = strlen(input);

    char *out = malloc(len - 2 + 1);

    strncpy(out, input, len - 2);
    return out;
}

compiles to:

(type (;0;) (func (param i32) (result i32)))
...
  (func (;3;) (type 0) (param i32) (result i32)
    (local i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
    block  ;; label = @1
      local.get 0
      local.tee 9
      i32.const 3
      i32.and
      if  ;; label = @2
        loop  ;; label = @3
          local.get 0
...

This should work out of the box, without having to write a wrapper that converts some binary representation into a C string.

@neumark
Copy link
Collaborator

neumark commented Oct 3, 2022

wasm UDF communication

Passing native primitive datatypes (i32, i64, f32, f64) from the host (wasmtime for seafowl) and receiving a native primitive result is straight forward. More complex data types such as string, structs, not so much.

WIT

In the long run, WebAssembly Interface Types (WIT) promise to provide an elegant solution to the problem of passing complex data between webassembly functions written in various high-level languages and the host. WIT includes an IDL, also called "wit" which can be used for code generation.
For example, below is the WIT description of a function which converts an input string to uppercase and returns the result:

upper: func(s: string) -> string

import code

WIT-generated calling code, in our case run by the seafowl process.

#[allow(clippy::all)]
mod input {
  pub fn upper(s: & str,) -> String{
    unsafe {
      let vec0 = s;
      let ptr0 = vec0.as_ptr() as i32;
      let len0 = vec0.len() as i32;
      
      #[repr(align(4))]
      struct __InputRetArea([u8; 8]);
      let mut __input_ret_area: __InputRetArea = __InputRetArea([0; 8]);
      let ptr1 = __input_ret_area.0.as_mut_ptr() as i32;
      #[link(wasm_import_module = "input")]
      extern "C" {
        #[cfg_attr(target_arch = "wasm32", link_name = "upper: func(s: string) -> string")]
        #[cfg_attr(not(target_arch = "wasm32"), link_name = "input_upper: func(s: string) -> string")]
        fn wit_import(_: i32, _: i32, _: i32, );
      }
      wit_import(ptr0, len0, ptr1);
      let len2 = *((ptr1 + 4) as *const i32) as usize;
      String::from_utf8(Vec::from_raw_parts(*((ptr1 + 0) as *const i32) as *mut _, len2, len2)).unwrap()
    }
  }
}

export code

WIT-generated wrapper around guest code (in our case the UDF).

#[allow(clippy::all)]
mod input {
  #[export_name = "upper: func(s: string) -> string"]
  unsafe extern "C" fn __wit_bindgen_input_upper(arg0: i32, arg1: i32, ) -> i32{
    let len0 = arg1 as usize;
    let result1 = <super::Input as Input>::upper(String::from_utf8(Vec::from_raw_parts(arg0 as *mut _, len0, len0)).unwrap());
    let ptr2 = __INPUT_RET_AREA.0.as_mut_ptr() as i32;
    let vec3 = (result1.into_bytes()).into_boxed_slice();
    let ptr3 = vec3.as_ptr() as i32;
    let len3 = vec3.len() as i32;
    core::mem::forget(vec3);
    *((ptr2 + 4) as *mut i32) = len3;
    *((ptr2 + 0) as *mut i32) = ptr3;
    ptr2
  }
  #[export_name = "cabi_post_upper"]
  unsafe extern "C" fn __wit_bindgen_input_upper_post_return(arg0: i32, ) {
    wit_bindgen_guest_rust::rt::dealloc(*((arg0 + 0) as *const i32), (*((arg0 + 4) as *const i32)) as usize, 1);
  }
  
  #[repr(align(4))]
  struct __InputRetArea([u8; 8]);
  static mut __INPUT_RET_AREA: __InputRetArea = __InputRetArea([0; 8]);
  pub trait Input {
    fn upper(s: String,) -> String;
  }
}

There exists a very early pre-alpha WIT implementation for rust supporting both rust hosts and WASM guests. The developers urge everyone interested in using this in production to hold their horses and look for other alternatives while the WIT standard is finalized, I'd guess somewhere between 12 - 18 months from now.

Alternatives until WIT can be used

Passing raw strings

The least ambitious, but by no means easiest approach is to extend the existing integer and float types currently supported in seafowl UDFs with strings. Not only would this provide support for using CHAR, TEXT, VARCHAR types in UDFs, more complex data structures could be submitted as serialized strings using JSON, MessagePack, CBOR, etc.

I wrote example proof of concept upper() function based on this excellent blogpost. Both the code invoking the WASM function, and that of the upper() function itself are fairly complex.

The complexity stems from the following:

  • WASM functions cannot access the hosts' memory. Any input or output passed via pointers must point to the module's memory. This places the burden of copying input from the host's memory to the guest's, malloc() -ing guest memory, copying the results back to host memory, and free()-ing input and output buffers. The result buffer must be allocated by the guest (since the size of the response isn't necessarily known), but must be freed by the host (since it must read the result before deallocating the result).
  • pascal vs c-style strings. C strings are just raw pointers terminated with \0. Pascal-style strings are prepended with their length in bytes, generally considered a better design these days. Naively returning a (length, pointer) would require passing multiple values, which isn't possible, but receiving and passing a pointer to the i32-encoded string length followed by the string itself is possible (this is what the WIT-generated code above does).

If strings aren't necessary UTF-8 string, but rather MessagePack-encoded streams of values, then all of the function arguments could be encoded in a single string, resulting in a simplified UDF WASM function signature:

fn(len: u32, ptr: u32) -> u32

Where the result is a pointer to a pascal-style string like in the WIT-generated code.

WaPC

The waPC project attempts to simplify wasm host-guest RPC. They provide a rust host and a number of supported guest languages. WaPC has its own GraphQL-inspired IDL language (WIDL). Based on GitHub activity, it seems to be an active project but lacks significant backing (written and mostly by 3 guys at a startup called Vino until recently). Links to step-by-step tutorials are all broken. WaPC uses MessagePack to serialize data by default.

WASM-bindgen

As a name that kept coming up during my research, wasm-bindgen deserves a mention. Its a mature solution for WASM RPC, but unfortunately limited to JavaScript host -> Rust WASM module guest calls. There was experimental support for WIT, but its not longer supported. In a future where WIT support returns, wasm-bindgen could be an ergonomic route to UDFs with complex inputs / outputs. Currently the guide on using it with rust hosts does not work as advertised.

WASI-based communication

The WebAssembly System Interface is an extension to WASM providing an interface to module functions for interacting with the host filesystem, command line arguments, environment variables, etc.
Like most things WASM-related, WASI itself is still in it's infancy and subject to change (the compiled wasm links to wasi_snapshot_preview1). Still, unlike WIT, WASI is already used in production and using it doesn't require a PhD in compiler design. Based on this blog post I implemented a version of upper() which gets its input from environment variables and prints the result to stdout. The env vars and standard output aren't the actual env vars and stdout of the host process, they're what seafowl passes as such to wasmtime. In other words, it's a convenient was to pass state to the WASM module function without having to deal with all the malloc and free choreography of the first solution. How much overhead this solution incurs compared to the first solution, I don't know yet.

Recommendation

Everyone -including myself- looks upon WIT as the "ultimate" solution to WASM RPC. Unfortunately, when WIT stabilizes is anyone's guess. The good news is that we don't have to commit to a single UDF interface for all time.

Seafowl already expects a language field in its UDF function creation statement, which could be used to distinguish between calling conventions.

If the overhead of using WASI is acceptable, reading serialized input from stdin and writing serialized output to stdout seems like a more ergonomic approach than requiring users creating UDFs to implement by hand code similar to what WIT generates. We could even allow error messages to be sent to stderr.

For "normal" UDFs, the input consists of a tuple of supported arrow types, so the serialized input could look something like this:

| i32: total bytes | messpack-encoded vector of arrow types | messagepack stream of serialized values |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants