`convert_latin1_to_utf8` doesn't accept length field for `utf8_output` pointer #354

Jarred-Sumner · 2023-11-25T08:51:25Z

the current header for convert_latin1_to_utf8 only accepts one length value for the latin1 input, and it is assumed that char* utf8_output has sufficient length to hold all the data.

  /**
   * Convert Latin1 string into UTF8 string.
   *
   * This function is suitable to work with inputs from untrusted sources.
   *
   * @param input         the Latin1 string to convert
   * @param length        the length of the string in bytes
   * @param latin1_output  the pointer to buffer that can hold conversion result
   * @return the number of written char; 0 if conversion is not possible
   */
  simdutf_warn_unused virtual size_t convert_latin1_to_utf8(const char * input, size_t length, char* utf8_output) const noexcept = 0;

Since 1 latin1 character can potentially become 2 utf8 bytes, the two lengths cannot be precisely the same unless the input is all ascii.

It'd be straightforward for the user of simdutf to add an extra pass to compute the length, but wouldn't that would that usually be slower than checking if there's enough length after each non-ascii character is written?

The text was updated successfully, but these errors were encountered:

lemire · 2023-11-26T01:50:57Z

The simdutf library has always worked in a two pass model: first compute how much memory is needed, and then we transcode.

      size_t expected_utf8words =
            simdutf::utf8_length_from_latin1(latin1_output.get(), latin1words);
      std::unique_ptr<char[]> utf8_output{ new char[expected_utf8words] };
        // convert to UTF-8
      size_t utf8words = simdutf::convert_latin1_to_utf8(
            latin1_output.get(), latin1words, utf8_output.get());

      size_t expected_latin1words = simdutf::latin1_length_from_utf8(source.c_str(), source.size());
      std::unique_ptr<char[]> latin1_output{
        new char[expected_latin1words]
      };
      // convert to latin1
      size_t latin1words = simdutf:: convert_utf8_to_latin1(
          source.c_str(), source.size(), latin1_output.get());

wouldn't that would that usually be slower than checking if there's enough length after each non-ascii character is written?

Computing the output size is much faster than transcoding.

If my answer is not satisfactory, can you elaborate?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`convert_latin1_to_utf8` doesn't accept length field for `utf8_output` pointer #354

`convert_latin1_to_utf8` doesn't accept length field for `utf8_output` pointer #354

Jarred-Sumner commented Nov 25, 2023

lemire commented Nov 26, 2023

convert_latin1_to_utf8 doesn't accept length field for utf8_output pointer #354

convert_latin1_to_utf8 doesn't accept length field for utf8_output pointer #354

Comments

Jarred-Sumner commented Nov 25, 2023

lemire commented Nov 26, 2023

`convert_latin1_to_utf8` doesn't accept length field for `utf8_output` pointer #354

`convert_latin1_to_utf8` doesn't accept length field for `utf8_output` pointer #354