New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make String.utf8Size()
public
#258
Comments
Alternatively, this method could be directly provided by /**
* Returns the biggest prefix of this string that can be represented with at most [maxLength] UTF-8 bytes.
*
* If the truncation occurs in the middle of a multibyte-character, the whole character is removed.
*/
fun String.truncateUtf8BytesLengthTo(maxLength: Int): String {
if (utf8Size() <= maxLength) {
return this
}
return encodeToByteArray().copyOf(maxLength).decodeToString().trimEnd { it == '\uFFFD' }
} But that might be a bit too specific to be in the library (I will leave that decision to you of couse). |
On non-JVM targets a value returned by
For invalid UTF-16 characters (like in the example above, where a string consist of only the low surrogate), However, the proper replacement should be There are a few options here:
|
Ugh! I didn't realize
Am I right that we're talking specifically about
For compatibility with what exactly? This is a Kotlin API, there is no such function in Java's strings. Or do you mean because
What do you mean by "always uses ? character as a replacement"? Do you mean in I don't think there is any compatibility guarantees So IMO the proper way to go would be to change
If we change |
With how Java encodes strings are being encoded to UTF-8.
Exactly, kotlinx-io/core/common/src/Utf8.kt Line 476 in c2d5220
Your example showcased that even though
Then |
Ah I see, so rather consistency with Java APIs, but not compatibility in the sense that it would break things if the stdlib had done it differently. To be frank I don't think similar behavior with Java APIs is necessarily something to aim for in cases where the behavior is debatable, but that's just a personal opinion.
Fair enough. Though this function (like most) assumes a well-formed input string. In that case
I think the best would be to fix the stdlib's function then :) But even without this, I still believe it's worth being inconsistent with the stdlib if it's more correct to behave this way. Especially in the context of KMP, having different behavior on different platforms is really undesirable.
Well.. I did write "both functions" 😄 I would be quite ok with that option actually. Not being able to use |
I am trying to migrate Krossbow, my STOMP-over-websocket multiplatform library, and I stumbled upon this function becoming private/internal in
kotlinx-io
.I relied on it because I need to truncate a string based on its length in UTF-8 bytes (not characters). The use case is the "close reason" (the text payload) of the web socket
Close
frame. As per the specification:With the
utf8Size
function, I could optimize the happy path by not copying bytes around if the string was short enough. Without this function, I need to encode the string into a byte array uncondtionally, then check if it's too long, in which case I need to truncate.I think there are probably more use cases where this is useful information to have without actually copying the bytes.
The text was updated successfully, but these errors were encountered: