Skip to content

Treatment of surrogate code points #119

Discussion options

You must be logged in to vote

Nice question.

I think this is more of a doc bug/imprecision than anything else. The issue is that I use the word "codepoint" in a lot of places when the more precise term here would be "Unicode scalar value." WIth that said, the specific part of the docs that describe the "substitution by maximal subparts" behavior is actually precisely correct:

In this strategy, a replacement codepoint is inserted whenever a byte is found that cannot possibly lead to a valid UTF-8 code unit sequence. If there were previous bytes that represented a prefix of a well-formed UTF-8 code unit sequence, then all of those bytes (up to 3) are substituted with a single replacement codepoint.

Since surrogate cod…

Replies: 4 comments 4 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by theodore-s-beers
Comment options

You must be logged in to vote
1 reply
@BurntSushi
Comment options

Comment options

You must be logged in to vote
3 replies
@BurntSushi
Comment options

@theodore-s-beers
Comment options

@BurntSushi
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants