Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert_latin1_to_utf8 doesn't accept length field for utf8_output pointer #354

Open
Jarred-Sumner opened this issue Nov 25, 2023 · 1 comment

Comments

@Jarred-Sumner
Copy link

the current header for convert_latin1_to_utf8 only accepts one length value for the latin1 input, and it is assumed that char* utf8_output has sufficient length to hold all the data.

  /**
   * Convert Latin1 string into UTF8 string.
   *
   * This function is suitable to work with inputs from untrusted sources.
   *
   * @param input         the Latin1 string to convert
   * @param length        the length of the string in bytes
   * @param latin1_output  the pointer to buffer that can hold conversion result
   * @return the number of written char; 0 if conversion is not possible
   */
  simdutf_warn_unused virtual size_t convert_latin1_to_utf8(const char * input, size_t length, char* utf8_output) const noexcept = 0;

Since 1 latin1 character can potentially become 2 utf8 bytes, the two lengths cannot be precisely the same unless the input is all ascii.

It'd be straightforward for the user of simdutf to add an extra pass to compute the length, but wouldn't that would that usually be slower than checking if there's enough length after each non-ascii character is written?

@lemire
Copy link
Member

lemire commented Nov 26, 2023

The simdutf library has always worked in a two pass model: first compute how much memory is needed, and then we transcode.

      size_t expected_utf8words =
            simdutf::utf8_length_from_latin1(latin1_output.get(), latin1words);
      std::unique_ptr<char[]> utf8_output{ new char[expected_utf8words] };
        // convert to UTF-8
      size_t utf8words = simdutf::convert_latin1_to_utf8(
            latin1_output.get(), latin1words, utf8_output.get());
      size_t expected_latin1words = simdutf::latin1_length_from_utf8(source.c_str(), source.size());
      std::unique_ptr<char[]> latin1_output{
        new char[expected_latin1words]
      };
      // convert to latin1
      size_t latin1words = simdutf:: convert_utf8_to_latin1(
          source.c_str(), source.size(), latin1_output.get());

wouldn't that would that usually be slower than checking if there's enough length after each non-ascii character is written?

Computing the output size is much faster than transcoding.

If my answer is not satisfactory, can you elaborate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants