Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UTF8 sequence length #44

Open
thomcc opened this issue Mar 10, 2020 · 1 comment
Open

Support UTF8 sequence length #44

thomcc opened this issue Mar 10, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@thomcc
Copy link
Contributor

thomcc commented Mar 10, 2020

I think it would be good to expose a free function from bstr that exposes some decoding-specific information about what a given byte means in the context of utf8.

Accessing this info is low level, but has various use cases -- some examples include finding a place to start parsing from given an index, finding a legal cutoff position if you need to truncate a buffer... Etc. (Let me know if you want more cases, I feel like I run into it a fair bit when working with partially invalid utf8).

Specifically, something like this:

// If `b` indicates the start of a utf8 sequnence boundary,
// returns `Some(sequence_len)`. Returns `None` for all other cases.
pub fn utf8_sequence_len(b: u8) -> Option<usize>;

Or... Maybe. I'd kinda like to distinguish between valid-but-not-leading and always-invalid bytes. Returning an enum maybe? Thoughts and bikeshedding welcome, I think in practice this would be useful, but also wanted to keep the things small and simple.


That said, I do feel strongly that this should not be methods on byteslice like ByteSlice::is_char_boundary(&self, index: usize) -> bool and ByteSlice::utf8_sequence_len(&self, index: usize) -> Option<usize> (mentioning mostly because I suggested these in #42) -- I think those two would be very confusing in practice:

  • ByteSlice::is_char_boundary would have to return different results from str::is_char_boundary even for a fully utf8 byte slice (example: index == len). Having the caller get the byte in question avoids this issue. (Renaming it doesn't even really solve this problem -- still seems like it could cause confusion if 0/len are not conidered boundaries).

  • ByteSlice::utf8_sequence_len(&self, idx) could behave too many ways -- specifically IDK if it only reads self[idx] or if it considers other bytes nearby (e.g. if it's not a leading byte). Making it a top level function only taking a u8 removes this ambiguity -- reasonably only one thing it could do

@BurntSushi
Copy link
Owner

This seems reasonable to me. Although I think this does come with a pretty big caveat that in the context of bstr, the value returned by this function is merely hint. There is of course no guarantee that there are actually a that number of bytes following b in the original slice (unlike in the case for &str).

I think it would help a lot if the docs for this contained a condensed example derived from a real use case, in order to help folks understand when they might want to use this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants