Support transcoding with replacement #147

lemire · 2022-07-12T18:20:45Z

Add functions which transcode even with the input is invalid, by replacing invalid character sequences with a replacement character.

lemire · 2022-07-12T18:20:52Z

cc @clausecker

clausecker · 2022-07-19T21:34:34Z

Suggested Unicode rules, see §5.22 and §3.9 subsection "U+FFFD Substitution of Maximal Subparts" of the Unicode spec.

lemire · 2022-07-19T21:44:42Z

Copy paste of some relevant parts...

An increasing number of implementations are adopting the handling of ill-formed subse- quences as specified in the W3C standard for encoding to achieve consistent U+FFFD replacements.
Although the Unicode Standard does not require this practice for conformance, the follow- ing text describes this practice and gives detailed examples.
D93a Unconvertibleoffset:Anoffsetinacodeunitsequenceforwhichnocodeunitsubse- quence starting at that offset is well-formed.
D93b Maximal subpart of an ill-formed subsequence: The longest code unit subsequence starting at an unconvertible offset that is either:
a. the initial subsequence of a well-formed code unit sequence, or
b. a subsequence of length one.
This definition of the maximal subpart is used in describing how far to advance processing when making substitutions: always process at least one code unit, or as many code units as match the beginning of a well-formed character, up to the point where the next code unit would make it ill-formed, that is, an offset is reached that does not continue this partial character.
Or stated more formally:
Whenever an unconvertible offset is reached during conversion of a code unit sequence:
1. The maximal subpart at that offset is replaced by a single U+FFFD.
2. The conversion proceeds at the offset immediately after the maximal subpart.
This practice of substituting maximal subparts can be trivially applied to the UTF-32 or UTF-16 encoding forms, but is primarily of interest when converting UTF-8 strings.
Unless the beginning of an ill-formed subsequence matches the beginning of some well- formed sequence, this practice replaces almost every byte of an ill-formed UTF-8 sequence with one U+FFFD. For example, every byte of a “non-shortest form” sequence (see Defini- tion D92), or of a truncated version thereof, is replaced, as shown in Table 3-8. (The inter- pretation of “non-shortest form” sequences has been forbidden since the publication of Corrigendum #1.)

When a conversion algorithm encounters such unconvertible data, the usual practice is either to throw an exception or to use a defined substitution character to represent the unconvertible data. In the case of conversion to one of the encoding forms of the Unicode Standard, the substitution character is defined as U+FFFD replacement character.
For conversion between different encoding forms of the Unicode Standard, “U+FFFD Sub- stitution of Maximal Subparts” in Section 3.9, Unicode Encoding Forms defines a practice for the use of U+FFFD which is consistent with the W3C standard for encoding. It is useful to apply the same practice to the conversion from non-Unicode encodings to an encoding form of the Unicode Standard.
This practice is more secure because it does not result in the conversion consuming parts of valid sequences as though they were invalid. It also guarantees at least one replacement character will occur for each instance of an invalid sequence in the original text. Further- more, this practice can be defined consistently for better interoperability between different implementations of conversion.
For full consistency, it is important for conversion implementations to agree on 1) the exact set of well-formed sequences for the source encoding, 2) all of the mappings for valid sequences, and 3) the details of the practice for handling ill-formed sequences.

amosnier · 2024-05-12T20:30:01Z

I guess UTF-8 decoder capability and stress test is relevant here, although I wish it would provide a set of pairs { UTF-8 input, expected UTF32-output } instead of a specification based on visual inspection.

My understanding is that UTF-8 decoding never should stop on error. Instead, it should always signal error by U+FFFD substitution, and it also always should re-synchronize on every valid first byte.

lemire · 2024-05-12T22:15:20Z

We’ll definitely support replacement in the future!

lemire mentioned this issue Feb 10, 2023

For short strings convert_utf8_to_utf16le_with_errors may sometimes cause buffer overflows by reading before the buffer when the input is not UTF-8 #213

Closed

lemire mentioned this issue Sep 5, 2023

Support Latin 1 <= UTF 8 (AVX) #285

Closed

amluto mentioned this issue Sep 14, 2023

validate_utf8_with_errors is insufficienty documented #304

Closed

lemire mentioned this issue Sep 14, 2023

Fix Issue 304 (better documentation for with_errors functions) #306

Merged

p-linnane mentioned this issue Nov 16, 2023

simdutf 4.0.4 Homebrew/homebrew-core#154563

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support transcoding with replacement #147

Support transcoding with replacement #147

lemire commented Jul 12, 2022

lemire commented Jul 12, 2022

clausecker commented Jul 19, 2022

lemire commented Jul 19, 2022

amosnier commented May 12, 2024

lemire commented May 12, 2024

Support transcoding with replacement #147

Support transcoding with replacement #147

Comments

lemire commented Jul 12, 2022

lemire commented Jul 12, 2022

clausecker commented Jul 19, 2022

lemire commented Jul 19, 2022

amosnier commented May 12, 2024

lemire commented May 12, 2024