Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize a compiled module #618

Closed
clarkmcc opened this issue Jun 3, 2022 · 12 comments · Fixed by #747
Closed

Serialize a compiled module #618

clarkmcc opened this issue Jun 3, 2022 · 12 comments · Fixed by #747
Labels
enhancement New feature or request

Comments

@clarkmcc
Copy link

clarkmcc commented Jun 3, 2022

Describe the solution you'd like
Is it possible to replicate this feature from wasmer-go? Compiling a module results in a significant number of allocations, so I'd like to compile a batch of modules ahead of time to reduce the runtime memory overhead.

@clarkmcc clarkmcc added the enhancement New feature or request label Jun 3, 2022
@codefromthecrypt
Copy link
Collaborator

I guess what we are really talking about here is externalizing the compilation cache. plus possibly some guard to make sure if the structure of the code changes the cache is invalidated.

		codes           map[wasm.ModuleID][]*code // guarded by mutex.

I think we can look into this after SIMD is done, wdyt @mathetake?

PS, related, but not exactly the same as this: #179

@F21
Copy link
Contributor

F21 commented Jun 3, 2022

I think this will also help us improve mjml-go performance: https://github.com/Boostport/mjml-go#benchmarks

Currently, spinning up a new worker using InstantiateModule() is quite expensive, so if we can clone a module, it would drive a lot of performance improvements.

@codefromthecrypt
Copy link
Collaborator

@F21 I think you already have an issues about InstantiateModule() #602 which is unrelated to this because Compile happens before that. Just setting expectations.

@F21
Copy link
Contributor

F21 commented Jun 3, 2022

Oops, I must have misread. This feature would still potentially be useful for us. We're currently shipping a .wasm compressed with brotli, which is decompressed and compiled in init(). If the file size from serializing a compiled module is acceptable, we can just ship that directly in our library and remove the .wasm completely.

@mathetake
Copy link
Member

mathetake commented Jun 3, 2022

Note that this comes with severe security concerns -- as this means that we allow users to directly execute any native code without validation passes (which are applied during wazero.CompileBinary). That means we also should provide some ways to do binary signing or something like that E.g. https://github.com/wasm-signatures/design

Regardless, I will work on this after finishing SIMD instructions (== completion of Wasm 2.0 Draft). Stay tuned!

@codefromthecrypt
Copy link
Collaborator

thanks for the feedback folks, this is great. don't worry too much about issue classification as we can sort it out.

@mathetake
Copy link
Member

I think all is set to implement the externalization of compiled native code...

@mathetake
Copy link
Member

mathetake commented Jun 30, 2022

one thing we have to figure out is that how to do the version/invalidation; if the internal of the compilers change, the compiled module won't work with the latest version of wazero. Therefore, we have to know if the compiled module binary's version (version of compiling wazero) matches the one of running wazero's version. Maybe embedding commit hash into the global variable helps in general if wazero is the executable (via Go's linker flag), but this time wazero is library.

@codefromthecrypt
Copy link
Collaborator

ack we need to look at existing VMs and how they do bytecode validation and to what degree this applies to us. Especially we shouldn't commit to a serialization mechanic prior to aug-31 which is our first beta, but that doesn't prevent experimentation before that.

@clarkmcc
Copy link
Author

clarkmcc commented Jul 1, 2022

@mathetake the challenge with the commit hash approach is that it will change, requiring re-serialization even if the compiler internals did not change. This project is currently pushing several commits a day. Changes to the ARM compiler for example would not need to impact modules compiled for x86 (not sure if that's even a legitimate example, but you get the idea).

Obviously, the harder problem is how do you actually version optimistically. I briefly looked at Wasmer and it looks like they have a dedicated serialization format that is manually adjusted whenever breaking changes are made. See wasmerio/wasmer#2747, CosmWasm/cosmwasm#1223. The downside of this approach is that it requires someone who is involved enough in the project to bump that version if a change to the project would break the serialization format.

The other option (and I don't know what this would actually look like in practice yet) is to hand-craft a wasm module that is able to get as close to 100% test coverage of the underlying interpreter as possible, and then the CI/CD indicates that we need to bump the serialization format if running that module fails.

@codefromthecrypt
Copy link
Collaborator

good points, @clarkmcc. I also don't believe a commit hash approach would work unless stable tags are in use (ex monthly tags starting end of august), and even then I think we may end up needing something more fine tuned than that

I plan to add some research here as well as the problem isn't unlike other tools with compilation cache regardless of webassembly. I'll also have a look at the links you mentioned while takeshi is out.

@codefromthecrypt
Copy link
Collaborator

To be transparent about current thinking and to not block others doing it :)

First, if someone can profile what's taking the longest for their code, if possible, as there are multiple stages. For example, the code is parsed and converted to wazeroir (which should become more stable since we have all 2.0 features now). That's is an easier thing to cache if a problem, though it varies on the feature flags used.

Next, the code isn't organized for cache externalization right now. For example, there are some organization done to support concurrent compilation and some caching aspects (ex compiler.engine.codes). This doesn't really imply we can externalize, yet, as for example the inputs aren't explicitly organized in a way to invalidate heuristically. Ex the inputs are the module, feature flags, specifically the table and global elements of the module can affect how functions are compiled. I expect that there's more organization work to do before attempting to externalize iotw, because the focus so far has been on completing compiler features.

Related art I was thinking about are tools that also deal with varied inputs that produce an externalized, but validated cache. We can think about go's code cache, and things in other environments with a lot of mileage, such as java's validation or gradle's layered validation approaches. Not trying to over-analyze here, but the very least we should deep dive into is how other wasm tools and how go works. I do think it is worth looking at at least on outside tool (ex java or graalvm) as often that gives perspective. This sort of research can run concurrent to code organization and other sorts of things.

I think something like this is tricky enough to take several weeks of work to pull off in a sustainable way. Since not all consumers update wazero commit-by-commit, we may be able to cheat and have some iterative progress in an experimental area that indeed punts some to a commit hash to buy time. However, I think by the end of research organization and design, commit hash yeah will either be tentative, or an cheaper validation option available alongside a heuristic one.

Meanwhile to set expectations, takeshi's out for 10 days. I don't plan to re-organize the compiler while he's away, so the thing that could help if anyone wants to contribute is more background on alternative compilation caches, and/or profiling of wasm they use to see where things are hurting most. Profiling might also show a way to progress meanwhile (ex parallel function compilation), which might be good enough for a bit.

Hope this helps!

mathetake added a commit that referenced this issue Aug 18, 2022
This adds the experimental support of the file system compilation cache.
Notably, experimental.WithCompilationCacheDirName allows users to configure
where the compiler writes the cache into.

Versioning/validation of binary compatibility has been done via the release tag
(which will be created from the end of this month). More specifically, the cache
file starts with a header with the hardcoded wazero version.


Fixes #618

Signed-off-by: Takeshi Yoneda <takeshi@tetrate.io>
Co-authored-by: Crypt Keeper <64215+codefromthecrypt@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants