Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.toUTF8 vs .fromUTF8 behavior is inconsistent, confusing and inefficient #387

Closed
kyegupov opened this issue Dec 23, 2018 · 3 comments
Closed

Comments

@kyegupov
Copy link

WASM is a low-level virtual machine, so it should be able to handle strings represented as binary arrays.
There are handy methods .fromUTF8 and .toUTF8: https://github.com/AssemblyScript/assemblyscript/blob/master/std/assembly/string.ts#L499

However, they are non-symmetrical in three ways:

  • fromUTF8 requires the knowledge of the string length, while toUTF8 does not return that value. You have to call lengthUTF8 separately, which is wasteful (it's already called in toUTF8)
  • fromUTF8 handles \0 unicode characters in strings correctly, while toUTF8 gives an impression these are not supported
  • fromUTF8 requires pure size of encoded string in bytes, while lengthUTF8 returns size with zero byte padding. Even existing tests have to adjust for that explicitly:
    assert(String.fromUTF8(ptr, len - 1) == str);

This is an inefficient and confusing approach. If the goal of AssemblyScript is to be a high-level WASM-friendly language, then having C-isms in the standard library like naked pointers to null-terminated strings feels like going against those goals.

My suggestion would be to rename .lengthUTF8 and .toUTF8 to .lengthUTF8ZeroTerminated, .toUTF8ZeroTerminated and introduce .toUTF8Buffer which returns an ArrayBuffer populated with the correct content and size. This API will be far more clear and convenient for users.

@kyegupov
Copy link
Author

Alternative solution would be to introduce a base type like

class MemSlice {
    constructor(readonly offset: usize, readonly length: usize) {}
    ...
}

which will be quite useful in general

@dcodeIO
Copy link
Member

dcodeIO commented Dec 23, 2018

Agreed, these APIs aren't ideal. Might even make sense to move them out of the string class to something specifically targeting interop (with C).

@MaxGraey
Copy link
Member

UTF8/UTF16 api was improved in this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants