Make `String.utf8Size()` public #258

joffrey-bion · 2024-02-12T15:42:45Z

I am trying to migrate Krossbow, my STOMP-over-websocket multiplatform library, and I stumbled upon this function becoming private/internal in kotlinx-io.

I relied on it because I need to truncate a string based on its length in UTF-8 bytes (not characters). The use case is the "close reason" (the text payload) of the web socket Close frame. As per the specification:

All control frames MUST have a payload length of 125 bytes or less and MUST NOT be fragmented.

With the utf8Size function, I could optimize the happy path by not copying bytes around if the string was short enough. Without this function, I need to encode the string into a byte array uncondtionally, then check if it's too long, in which case I need to truncate.

I think there are probably more use cases where this is useful information to have without actually copying the bytes.

The text was updated successfully, but these errors were encountered:

joffrey-bion · 2024-02-12T15:45:35Z

Alternatively, this method could be directly provided by kotlinx-io instead:

/**
 * Returns the biggest prefix of this string that can be represented with at most [maxLength] UTF-8 bytes.
 *
 * If the truncation occurs in the middle of a multibyte-character, the whole character is removed.
 */
fun String.truncateUtf8BytesLengthTo(maxLength: Int): String {
    if (utf8Size() <= maxLength) {
        return this
    }
    return encodeToByteArray().copyOf(maxLength).decodeToString().trimEnd { it == '\uFFFD' }
}

But that might be a bit too specific to be in the library (I will leave that decision to you of couse).

fzhinkin · 2024-05-03T09:57:50Z

On non-JVM targets a value returned by utf8Size may not be the same as encodeToByteArray().size:

println("🌞".substring(1).utf8Size()) // 1 on all targets
println("🌞".substring(1).encodeToByteArray().size) // 1 on JVM, 3 on all other targets

For invalid UTF-16 characters (like in the example above, where a string consist of only the low surrogate), kotlinx-io always uses ? character as a replacement (the same way JVM does).

However, the proper replacement should be U+FFFD (it requires 3 code units to encode) and Kotlin Stdlib uses it on all platforms other than JVM, and on JVM for compatibility reasons it also uses ?.

There are a few options here:

(somehow) confine utf8Size to the Sinks scope, to make everyone know that the returned value makes sense only for Sink.writeString; (probably) mark it as DelicateIoApi, write a comprehensive doc describing what's wrong with it;
align writeString and utf8Size behavior with Kotlin Stdlib.

joffrey-bion · 2024-05-03T16:39:33Z

Ugh! I didn't realize encodeToByteArray() worked this way for incorrect standalone surrogates on JVM 🤯
Little experiments for those interested: https://pl.kotl.in/Yg3GwVJEc.

Kotlin Stdlib uses it

Am I right that we're talking specifically about String.encodeToByteArray() here?

and on JVM for compatibility reasons it also uses ?.

For compatibility with what exactly? This is a Kotlin API, there is no such function in Java's strings.
I find it quite unfortunate that we can't have proper multiplatform tests for this kind of things due to a deliberate inconsistency.

Or do you mean because encoreToByteArray was JVM-only for a time, and it was wrong at that time, and now we can't go back?

For invalid UTF-16 characters (like in the example above, where a string consist of only the low surrogate), kotlinx-io always uses ? character as a replacement (the same way JVM does).

What do you mean by "always uses ? character as a replacement"? Do you mean in writeString, regardless of the platform? It seems strange to use the ? character when writing to sinks to be frank, because there is no way to distinguish this replacement from an actual '?' character in the source string.

I don't think there is any compatibility guarantees kotlinx-io needs to respect for now, given that it's a 0.x, right? Even more so because the string itself is invalid in this state in the first place (heck we could even crash here! 😄).

So IMO the proper way to go would be to change writeString itself to use replacement characters for invalid surrogates like this on all platforms, and that utf8Size() takes this into account and returns 3 on all platforms. Or just make both functions fail on invalid UTF-16 char sequences 🤷

align writeString and utf8Size behavior with Kotlin Stdlib

If we change writeString and utf8Size, I believe we may as well make them behave more correctly than encodeToByteArray. I don't think kotlinx-io needs to be consistent with the stdlib's inconsistency.

fzhinkin · 2024-05-06T10:12:11Z

For compatibility with what exactly?

With how Java encodes strings are being encoded to UTF-8.

What do you mean by "always uses ? character as a replacement"? Do you mean in writeString, regardless of the platform?

Exactly, writeString unconditionally emits ? as a replacement character:

kotlinx-io/core/common/src/Utf8.kt

Line 476 in c2d5220

writeByte('?'.code.toByte())

I don't think there is any compatibility guarantees kotlinx-io needs to respect for now, given that it's a 0.x, right?

0.x gives us the freedom to a certain extent, but it would be nice to behave consistently with Kotlin Stdlib as some parts of kotlinx-io may end up there eventually.

So IMO the proper way to go would be to change writeString itself to use replacement characters for invalid surrogates like this on all platforms, and that utf8Size() takes this into account and returns 3 on all platforms.

Your example showcased that even though utf8Size() is kotlinx-io-specific, it's quite tempting to use it in conjunction with String::encodeToByteArray. So, it's better to return a value that'll be correct for both Stdlib and kotlinx-io string conversions.

Or just make both functions fail on invalid UTF-16 char sequences 🤷

Then writeString should fail too (which will be quite unfortunate), as being able to write an ill-formed string but not being able to query its size beforehand seems a bit off.

joffrey-bion · 2024-05-08T01:56:39Z

With how Java encodes strings are being encoded to UTF-8.

Ah I see, so rather consistency with Java APIs, but not compatibility in the sense that it would break things if the stdlib had done it differently. To be frank I don't think similar behavior with Java APIs is necessarily something to aim for in cases where the behavior is debatable, but that's just a personal opinion.

Your example showcased that even though utf8Size() is kotlinx-io-specific, it's quite tempting to use it in conjunction with String::encodeToByteArray.

Fair enough. Though this function (like most) assumes a well-formed input string. In that case utf8Size() and encodeToByteArray() behave consistently. The fun part happens when truncating, and decodeToString actually does decode invalid bytes as a replacement char. This is why this implementation is valid for valid strings.

it's better to return a value that'll be correct for both Stdlib and kotlinx-io string conversions

I think the best would be to fix the stdlib's function then :) But even without this, I still believe it's worth being inconsistent with the stdlib if it's more correct to behave this way. Especially in the context of KMP, having different behavior on different platforms is really undesirable.

Or just make both functions fail on invalid UTF-16 char sequences

Then writeString should fail too

Well.. I did write "both functions" 😄 I would be quite ok with that option actually. Not being able to use writeString() with an invalid string is probably acceptable. You can still write any byte you want using other functions.

fzhinkin added the enhancement label Feb 12, 2024

joffrey-bion mentioned this issue Feb 12, 2024

A trimmed-down API #136

Merged

3 tasks

fzhinkin added the Design label Feb 23, 2024

fzhinkin added the API improvement label Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `String.utf8Size()` public #258

Make `String.utf8Size()` public #258

joffrey-bion commented Feb 12, 2024 •

edited

joffrey-bion commented Feb 12, 2024 •

edited

fzhinkin commented May 3, 2024

joffrey-bion commented May 3, 2024 •

edited

fzhinkin commented May 6, 2024

joffrey-bion commented May 8, 2024

Make String.utf8Size() public #258

Make String.utf8Size() public #258

Comments

joffrey-bion commented Feb 12, 2024 • edited

joffrey-bion commented Feb 12, 2024 • edited

fzhinkin commented May 3, 2024

joffrey-bion commented May 3, 2024 • edited

fzhinkin commented May 6, 2024

joffrey-bion commented May 8, 2024

Make `String.utf8Size()` public #258

Make `String.utf8Size()` public #258

joffrey-bion commented Feb 12, 2024 •

edited

joffrey-bion commented Feb 12, 2024 •

edited

joffrey-bion commented May 3, 2024 •

edited