Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support transcoding with replacement #147

Open
lemire opened this issue Jul 12, 2022 · 5 comments
Open

Support transcoding with replacement #147

lemire opened this issue Jul 12, 2022 · 5 comments

Comments

@lemire
Copy link
Member

lemire commented Jul 12, 2022

Add functions which transcode even with the input is invalid, by replacing invalid character sequences with a replacement character.

@lemire
Copy link
Member Author

lemire commented Jul 12, 2022

cc @clausecker

@clausecker
Copy link
Collaborator

Suggested Unicode rules, see §5.22 and §3.9 subsection "U+FFFD Substitution of Maximal Subparts" of the Unicode spec.

@lemire
Copy link
Member Author

lemire commented Jul 19, 2022

Copy paste of some relevant parts...

An increasing number of implementations are adopting the handling of ill-formed subse- quences as specified in the W3C standard for encoding to achieve consistent U+FFFD replacements.
Although the Unicode Standard does not require this practice for conformance, the follow- ing text describes this practice and gives detailed examples.
D93a Unconvertibleoffset:Anoffsetinacodeunitsequenceforwhichnocodeunitsubse- quence starting at that offset is well-formed.
D93b Maximal subpart of an ill-formed subsequence: The longest code unit subsequence starting at an unconvertible offset that is either:
a. the initial subsequence of a well-formed code unit sequence, or
b. a subsequence of length one.
This definition of the maximal subpart is used in describing how far to advance processing when making substitutions: always process at least one code unit, or as many code units as match the beginning of a well-formed character, up to the point where the next code unit would make it ill-formed, that is, an offset is reached that does not continue this partial character.
Or stated more formally:
Whenever an unconvertible offset is reached during conversion of a code unit sequence:
1. The maximal subpart at that offset is replaced by a single U+FFFD.
2. The conversion proceeds at the offset immediately after the maximal subpart.
This practice of substituting maximal subparts can be trivially applied to the UTF-32 or UTF-16 encoding forms, but is primarily of interest when converting UTF-8 strings.
Unless the beginning of an ill-formed subsequence matches the beginning of some well- formed sequence, this practice replaces almost every byte of an ill-formed UTF-8 sequence with one U+FFFD. For example, every byte of a “non-shortest form” sequence (see Defini- tion D92), or of a truncated version thereof, is replaced, as shown in Table 3-8. (The inter- pretation of “non-shortest form” sequences has been forbidden since the publication of Corrigendum #1.)

When a conversion algorithm encounters such unconvertible data, the usual practice is either to throw an exception or to use a defined substitution character to represent the unconvertible data. In the case of conversion to one of the encoding forms of the Unicode Standard, the substitution character is defined as U+FFFD replacement character.
For conversion between different encoding forms of the Unicode Standard, “U+FFFD Sub- stitution of Maximal Subparts” in Section 3.9, Unicode Encoding Forms defines a practice for the use of U+FFFD which is consistent with the W3C standard for encoding. It is useful to apply the same practice to the conversion from non-Unicode encodings to an encoding form of the Unicode Standard.
This practice is more secure because it does not result in the conversion consuming parts of valid sequences as though they were invalid. It also guarantees at least one replacement character will occur for each instance of an invalid sequence in the original text. Further- more, this practice can be defined consistently for better interoperability between different implementations of conversion.
For full consistency, it is important for conversion implementations to agree on 1) the exact set of well-formed sequences for the source encoding, 2) all of the mappings for valid sequences, and 3) the details of the practice for handling ill-formed sequences.

@amosnier
Copy link

I guess UTF-8 decoder capability and stress test is relevant here, although I wish it would provide a set of pairs { UTF-8 input, expected UTF32-output } instead of a specification based on visual inspection.

My understanding is that UTF-8 decoding never should stop on error. Instead, it should always signal error by U+FFFD substitution, and it also always should re-synchronize on every valid first byte.

@lemire
Copy link
Member Author

lemire commented May 12, 2024

We’ll definitely support replacement in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants