Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make String.utf8Size() public #258

Open
joffrey-bion opened this issue Feb 12, 2024 · 5 comments
Open

Make String.utf8Size() public #258

joffrey-bion opened this issue Feb 12, 2024 · 5 comments

Comments

@joffrey-bion
Copy link

joffrey-bion commented Feb 12, 2024

I am trying to migrate Krossbow, my STOMP-over-websocket multiplatform library, and I stumbled upon this function becoming private/internal in kotlinx-io.

I relied on it because I need to truncate a string based on its length in UTF-8 bytes (not characters). The use case is the "close reason" (the text payload) of the web socket Close frame. As per the specification:

All control frames MUST have a payload length of 125 bytes or less and MUST NOT be fragmented.

With the utf8Size function, I could optimize the happy path by not copying bytes around if the string was short enough. Without this function, I need to encode the string into a byte array uncondtionally, then check if it's too long, in which case I need to truncate.

I think there are probably more use cases where this is useful information to have without actually copying the bytes.

@joffrey-bion
Copy link
Author

joffrey-bion commented Feb 12, 2024

Alternatively, this method could be directly provided by kotlinx-io instead:

/**
 * Returns the biggest prefix of this string that can be represented with at most [maxLength] UTF-8 bytes.
 *
 * If the truncation occurs in the middle of a multibyte-character, the whole character is removed.
 */
fun String.truncateUtf8BytesLengthTo(maxLength: Int): String {
    if (utf8Size() <= maxLength) {
        return this
    }
    return encodeToByteArray().copyOf(maxLength).decodeToString().trimEnd { it == '\uFFFD' }
}

But that might be a bit too specific to be in the library (I will leave that decision to you of couse).

@fzhinkin
Copy link
Collaborator

fzhinkin commented May 3, 2024

On non-JVM targets a value returned by utf8Size may not be the same as encodeToByteArray().size:

println("🌞".substring(1).utf8Size()) // 1 on all targets
println("🌞".substring(1).encodeToByteArray().size) // 1 on JVM, 3 on all other targets

For invalid UTF-16 characters (like in the example above, where a string consist of only the low surrogate), kotlinx-io always uses ? character as a replacement (the same way JVM does).

However, the proper replacement should be U+FFFD (it requires 3 code units to encode) and Kotlin Stdlib uses it on all platforms other than JVM, and on JVM for compatibility reasons it also uses ?.

There are a few options here:

  • (somehow) confine utf8Size to the Sinks scope, to make everyone know that the returned value makes sense only for Sink.writeString; (probably) mark it as DelicateIoApi, write a comprehensive doc describing what's wrong with it;
  • align writeString and utf8Size behavior with Kotlin Stdlib.

@joffrey-bion
Copy link
Author

joffrey-bion commented May 3, 2024

Ugh! I didn't realize encodeToByteArray() worked this way for incorrect standalone surrogates on JVM 🤯
Little experiments for those interested: https://pl.kotl.in/Yg3GwVJEc.

Kotlin Stdlib uses it

Am I right that we're talking specifically about String.encodeToByteArray() here?

and on JVM for compatibility reasons it also uses ?.

For compatibility with what exactly? This is a Kotlin API, there is no such function in Java's strings.
I find it quite unfortunate that we can't have proper multiplatform tests for this kind of things due to a deliberate inconsistency.

Or do you mean because encoreToByteArray was JVM-only for a time, and it was wrong at that time, and now we can't go back?

For invalid UTF-16 characters (like in the example above, where a string consist of only the low surrogate), kotlinx-io always uses ? character as a replacement (the same way JVM does).

What do you mean by "always uses ? character as a replacement"? Do you mean in writeString, regardless of the platform? It seems strange to use the ? character when writing to sinks to be frank, because there is no way to distinguish this replacement from an actual '?' character in the source string.

I don't think there is any compatibility guarantees kotlinx-io needs to respect for now, given that it's a 0.x, right? Even more so because the string itself is invalid in this state in the first place (heck we could even crash here! 😄).

So IMO the proper way to go would be to change writeString itself to use replacement characters for invalid surrogates like this on all platforms, and that utf8Size() takes this into account and returns 3 on all platforms. Or just make both functions fail on invalid UTF-16 char sequences 🤷

align writeString and utf8Size behavior with Kotlin Stdlib

If we change writeString and utf8Size, I believe we may as well make them behave more correctly than encodeToByteArray. I don't think kotlinx-io needs to be consistent with the stdlib's inconsistency.

@fzhinkin
Copy link
Collaborator

fzhinkin commented May 6, 2024

For compatibility with what exactly?

With how Java encodes strings are being encoded to UTF-8.

What do you mean by "always uses ? character as a replacement"? Do you mean in writeString, regardless of the platform?

Exactly, writeString unconditionally emits ? as a replacement character:

writeByte('?'.code.toByte())

I don't think there is any compatibility guarantees kotlinx-io needs to respect for now, given that it's a 0.x, right?

0.x gives us the freedom to a certain extent, but it would be nice to behave consistently with Kotlin Stdlib as some parts of kotlinx-io may end up there eventually.

So IMO the proper way to go would be to change writeString itself to use replacement characters for invalid surrogates like this on all platforms, and that utf8Size() takes this into account and returns 3 on all platforms.

Your example showcased that even though utf8Size() is kotlinx-io-specific, it's quite tempting to use it in conjunction with String::encodeToByteArray. So, it's better to return a value that'll be correct for both Stdlib and kotlinx-io string conversions.

Or just make both functions fail on invalid UTF-16 char sequences 🤷

Then writeString should fail too (which will be quite unfortunate), as being able to write an ill-formed string but not being able to query its size beforehand seems a bit off.

@joffrey-bion
Copy link
Author

With how Java encodes strings are being encoded to UTF-8.

Ah I see, so rather consistency with Java APIs, but not compatibility in the sense that it would break things if the stdlib had done it differently. To be frank I don't think similar behavior with Java APIs is necessarily something to aim for in cases where the behavior is debatable, but that's just a personal opinion.

Your example showcased that even though utf8Size() is kotlinx-io-specific, it's quite tempting to use it in conjunction with String::encodeToByteArray.

Fair enough. Though this function (like most) assumes a well-formed input string. In that case utf8Size() and encodeToByteArray() behave consistently. The fun part happens when truncating, and decodeToString actually does decode invalid bytes as a replacement char. This is why this implementation is valid for valid strings.

it's better to return a value that'll be correct for both Stdlib and kotlinx-io string conversions

I think the best would be to fix the stdlib's function then :) But even without this, I still believe it's worth being inconsistent with the stdlib if it's more correct to behave this way. Especially in the context of KMP, having different behavior on different platforms is really undesirable.

Or just make both functions fail on invalid UTF-16 char sequences

Then writeString should fail too

Well.. I did write "both functions" 😄 I would be quite ok with that option actually. Not being able to use writeString() with an invalid string is probably acceptable. You can still write any byte you want using other functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants