Skip to content
This repository has been archived by the owner on Jan 31, 2023. It is now read-only.

Why is encoding not in the type? #3

Open
skuzmich opened this issue Nov 5, 2020 · 4 comments
Open

Why is encoding not in the type? #3

skuzmich opened this issue Nov 5, 2020 · 4 comments

Comments

@skuzmich
Copy link

skuzmich commented Nov 5, 2020

Wasm engines that know available encoding statically would sometimes be able to generate more efficient code, especially when compiling without dynamic profile information.

Then we can have instruction to convert between different encodings. Conversion result can also be cached to external hash table (or to an object slot) by engine to avoid repeating the same conversion multiple types.

Would JS embeddings require wrapping JS string with an extra Wasm string object in order to provide an extra WTF-8 slot? Knowing WTF-16 encoding statically would help to avoid that.

This also could scale better with adding more encodings in the future. More than two encoding slots per objects would mandate indirect access, right?

@dcodeIO
Copy link
Member

dcodeIO commented Nov 5, 2020

Yeah, more statically available information is always better. Wondering how we can provide it, though, given that Wasm modules and Wasm hosts are composable in many ways, so a module may run in a browser, or let's say in Wasmtime, or an import may be a JS module, or a Rust module. As such this mostly focuses on caching so far, but I am certainly interested in your ideas.

Regarding additional encodings, I currently would not expect that we'd need more anytime soon, but I may be wrong. Afaict, the bulk of languages uses either W/UTF-8 or W/UTF-16 and supporting anything else (in browsers) may turn out to be impractical. Do you have any specific encoding in mind?

@skuzmich
Copy link
Author

skuzmich commented Nov 5, 2020

I don't have a concrete additional encoding in mind. But anyway, now I think you would want to have indirection in string object for both slots, because alternative is preallocating space for potentially not needed encoding, which is ~2x memory consumption in the worst case, which is really bad.

Based on what I understood from your proposal, universal string would look like:

struct WasmString {
   WTF8String  *slot1;
   WTF16String *slot2;
};

Which means:

  • extra allocation for each wasm string
  • extra indirection for char access (could be lifted and amortized in hot loops, but not always)

I'm assuming that when you pass JS string to Wasm you would allocate WasmString and attach JS string to slot2. You would need to cache JS string -> Wasm string mapping in order to avoid allocating different WasmString for the same JS string (Assuming you don't want to force JS to use this 2-slot structure). Turns out there is a cost attached to JS->Wasm string interop, which is unfortunate.

Languages where String is not a primitive type, but a subtype of other classes, like Object, would need to have custom fields in String for "type info" or "v-table". They would have to choose from:

  1. Universal strings wrapped with GC struct

            (struct $String 
                    (field $vtable) 
                    (field $s stringref)
    
    • 2 indirections 👎🏻 👎🏻
    • Needs no copy when passing to JS 👍🏻
    • Needs two levels of caches to avoid useless box allocations 👎🏻 👎🏻
    • Universal?
  2. Wasm's flexible GC structs (arrays with custom fields)

           (struct $String 
                   (field $vtable) 
                   (field $s (array i16)))
    
    • 0 indirections 🚀
    • Needs a copy to work with JS or browser API 👎🏻
    • Needs a cache to avoid subsequent copy when interacting with JS 👎🏻
    • Not universal 👎🏻

So, if universal strings would be inefficient, languages might choose to use Wasm array if they need fast strings within Wasm module. Or they might choose to keep compiling to JS for fast JS and DOM interop. Universal strings might become less than universal.

There might be a third compelling alternative with great JS interop, and less overhead compared to "universal" strings:

  1. Use JS strings (e.g. concrete WTF-16 string ref type)
            (struct $String 
                    (field $vtable) 
                    (field $s (ref $js_string))
    
    • 1 indirections 👎🏻
    • No copy JS interop 👍🏻
    • Needs 1 level of cache 👎🏻
    • Universal for JS embedding across all WTF-16 languages 👍🏻

Alternatively you could attach runtime $vtable to Strings only when they String loose its static type (e.g. casted to Object). But this approach has trade-offs and is still worse compared to attaching v-table directly to char array.


If you have an API that wants to be compatible across embeddings, it may choose a concrete encoding (or provide a multiple overloads for different encodings of strings). Languages that would want to use these APIs would adapt its strings on boundaries if needed.

I'm afraid I don't see a way to make a universal string type that supports multiple encodings without sacrificing use-cases within a single known encoding.

@dcodeIO
Copy link
Member

dcodeIO commented Nov 5, 2020

  1. Languages where String is not a primitive type, but a subtype of other classes, like Object, would need to have custom fields in String for "type info" or "v-table".

Do you know how the JVM solves this?

  1. Wasm's flexible GC structs (arrays with custom fields)

I guess analogous to arrays with custom fields, there may as well be strings with custom fields, with the encoding slots being implicit.

  1. Use JS strings (e.g. concrete WTF-16 string ref type)

Yeah, that's one of the alternative ideas brought up in related discussion, using type pre-imports. The hope expressed was that if enough languages (running in the browser) go for it, these languages will be interoperable (in the browser), but I worry that non-WTF-16 languages will not adapt it, or non-WTF-16 hosts won't provide it, again leading to copying overhead, ecosystem fragmentation or the like. If everything else fails, I do consider it as the most viable alternative, though.

Regarding efficiency in general, my mental model is that dynamic checks are only necessary if an encoding cannot statically be determined, like for example only the first instruction in a code path needs to do a dynamic check, and subsequent code would become just indirection. It becomes even easier when a module uses one encoding scheme exclusively (the common case), where the check can be performed at the boundary, so engines may be able to avoid indirection. There's more written down in this paragraph. Do you see problems with the assumptions made there?

@skuzmich
Copy link
Author

skuzmich commented Nov 6, 2020

Do you know how the JVM solves this?

Major JVM implementations use array with custom field prefix inherited from Object, without indirection.

JVM languages have to use this String class and Java object model if they want to be efficient and compatible with enormous amount of Java libraries and legacy codebases.

Sadly, nobody uses this new Wasm GC string type we are designing, so it would be hard to convince languages to adopt it in cases when alternative is better for their non-universal use-cases, for example targeting browsers and interacting with JS and DOM, where the only currently relevant language (ecosystem-wise) is JavaScript.

I guess analogous to arrays with custom fields, there may as well be strings with custom fields, with the encoding slots being implicit.

Custom fields would make strings incompatible with JS, DOM and across languages, requiring copies, am I right?

There's more written down in this paragraph. Do you see problems with the assumptions made there?

Yes, there are a lot of optimization opportunities, but it hard to predict how good it is compared to specific strings.
Some potential problems from top of my head:
- First access in (not inlined) function would require encoding check and unwrapping
- Unwrapping would increase register pressure
- Extra 64+ bits per string of extra memory


By thinking about it a little bit more, it feels like your universal Wasm type can be represented as a Wasm GC struct of two concrete string types, and check avoiding optimizations can be done by language toolchains. So languages can use concrete types internally where possible. And the hard problem would be convincing languages to agree to use this GC struct at interface boundaries. But I might be missing something here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants