Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map each code point of an ill-formed UTF-8 subsequence to a replacement character individually #301

Open
fzhinkin opened this issue Apr 25, 2024 · 1 comment

Comments

@fzhinkin
Copy link
Collaborator

As it was pointed out in #290 (comment), kotlinx-io converts different ill-formed UTF-8 subsequences differently: either the whole multi-code-point subsequence replaced with a single replacement character, or each code points is converted separately:

  • 0xf0 0x89 0x89 <EOF> ->
  • 0xf0 0x89 0x89 0x89 <EOF> ->
  • 0xf0 0xf0 0xf0 <EOF> -> ���

The UTF-8 spec allows handling these ill-formed sequences whatever way we want as long as errors are somehow reported. However, such behavior looks a bit inconsistent and it's hard to reason about how an arbitrary byte sequences will be converted.

We should improve the way ill-formed sequences are handled and stick to an approach adopted by other languages/libraries: convert only ill-formed subsequences consisting of a single byte.

That's how it's done in:

  • Java:
jshell> new String(new byte[]{(byte)0xf0,(byte)0x89,(byte)0x89,(byte)0x89})
$5 ==> "����"
  • Python 3:
>>> b'\xf0\x89\x89\x89'.decode("utf-8", errors='replace')
'����
  • Go:
fmt.Println(string([]byte{0xf0, 0x89, 0x89, 0x89}))
...

����
@ilya-g
Copy link
Member

ilya-g commented May 1, 2024

See also the recommendation "U+FFFD Substitution of Maximal Subparts" in https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf

@fzhinkin fzhinkin added this to the kotlinx-io stabilization milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants