Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteSlice should offer more functions taking u8 #12

Open
thomcc opened this issue Jul 15, 2019 · 10 comments
Open

ByteSlice should offer more functions taking u8 #12

thomcc opened this issue Jul 15, 2019 · 10 comments
Labels
enhancement New feature or request

Comments

@thomcc
Copy link
Contributor

thomcc commented Jul 15, 2019

I'm using bstr to help parse a few formats, some of them are text (but not necessarily utf8) and some of them are binary but contain strings.

I have a few things that I wished were present:

  • The various split functions should have variants for passing a byte instead of just str/char. I ended up using split_str(b"\0") and split_str(b"\xff") a few times which is going to be less efficient than directly invoking memchr.

  • Versions of fields_with/trim_start_with/trim_end_with which pass their function the byte instead, and don't bother with UTF-8 decoding.

It seems possible that you're more interested in this being useful for probably-text case than for (e.g. the emphasis is on the str, and not the b). If that's the case, sorry for this and the next bug I'm going to file!

@BurntSushi
Copy link
Owner

The various split functions should have variants for passing a byte instead of just str/char. I ended up using split_str(b"\0") and split_str(b"\xff") a few times which is going to be less efficient than directly invoking memchr.

Are you certain? Have you benchmarked it? If you search (or split, in this case) with a string containing exactly one byte, then memchr is exactly what's invoked.

Versions of fields_with/trim_start_with/trim_end_with which pass their function the byte instead, and don't bother with UTF-8 decoding.

I think I'd be okay with these. Probably call them, fields_with_byte? If you want to put up a PR that'd be great, otherwise I'll probably eventually get to it.

It seems possible that you're more interested in this being useful for probably-text case than for (e.g. the emphasis is on the str, and not the b)

Possibly. Its focus is certainly on treating bytes as if they were UTF-8, but there are plenty of functions that don't care about UTF-8 at all (e.g., find).

I'm generally not opposed to loading up the API with more methods, but each should definitely be considered on their own merit. I think this crate is still finding its footing with respect to what the right API should be, so more issues are welcome.

@BurntSushi
Copy link
Owner

@thomcc Oh, I also meant to ask, I'm always looking to understand use cases better. Could you share more about what you're using bstr for, if possible? (As in, what high level problem are you trying to solve.) If not, no worries!

@thomcc
Copy link
Contributor Author

thomcc commented Jul 15, 2019

Are you certain? Have you benchmarked it? If you search (or split, in this case) with a string containing exactly one byte, then memchr is exactly what's invoked.

Ah, my bad 😅, I read the code instead of running benchmarks, but missed the special case that just calls to find_byte, embarrassing.

In that case, disregard that one, it feels a little weird, but not worth adding single-byte variants of everything.

Possibly. Its focus is certainly on treating bytes as if they were UTF-8, but there are plenty of functions that don't care about UTF-8 at all (e.g., find).

I see. Honestly, I'm mostly using it as a for a feature-rich &[u8], and the ability to more easily debug byte-strings that are output. However, what text is there is mostly at least ASCII, and there's enough of it that the debug output is generally more useful than if it were all hex.

(That said, a lot of the API I'm not currently using at the moment interests me a lot, in particular, the explicit control over UTF8 decoding is vastly more convenient that creating a 1-off .chars() iterator, let along

I think I'd be okay with these. Probably call them, fields_with_byte? If you want to put up a PR that'd be great, otherwise I'll probably eventually get to it.

I'll see. I'm more likely to take #13 because it addresses similar use cases, but can actually be optimized, unlike a generic user callback.

I think this crate is still finding its footing with respect to what the right API should be, so more issues are welcome.

Okay, I had some others too, which I'll file when after I give them a little more thought.

@lopopolo
Copy link
Contributor

@BurntSushi One thing I ran into recently is the trim family of functions cannot be used for trimming invalid UTF-8 bytes since it is not possible to distinguish between the replacement character sigil and the actual replacement character.

I believe Option<char> is the same size as char and based on how I'm using APIs like ByteSlice::char_indices I don't think ergonomics would suffer. Is altering char-oriented APIs to return Option<char> feasible?

@BurntSushi
Copy link
Owner

@lopopolo Sorry, I'm having trouble parsing your comment. The first part seems to be discussing a deficiency with the trim routines while the second part seems to be referring to the return type of other routines. (Which routines?) I'm unsure how these are connected and don't otherwise quite understand the second part.

The trim routines to me seem like fairly string oriented to me. It seems a little weird to what to trim by something other than codepoints. Can you say what your use case is? If it's something reasonable, then I'm not opposed to adding trim_byte_with/trim_byte_start_with/trim_byte_end_with I guess.

@lopopolo
Copy link
Contributor

lopopolo commented Dec 14, 2020

All functions that either yield char or pass char to a closure use the Unicode replacement character as a sentinel that indicates there are invalid UTF-8 bytes at the current position in the bytestring, but this codepoint is ambiguous with an actual Unicode replacement character appearing in the bytestring:

  • ByteSlice::fields_with
  • ByteSlice::chars
  • ByteSlice::char_indices
  • ByteSlice::trim_with
  • ByteSlice::trim_start_with
  • ByteSlice::trim_end_with

For APIs like ByteSlice::chars and the trim family of APIs, it is not possible to distinguish between the Unicode replacement character and an invalid UTF-8 byte sequence. See eafb495 for a spot where this ambiguity has caused a bug.

One example API I'd like to implement is String#chop. For conventionally UTF-8 strings, String#chop should remove the last UTF-8 character or, if the string ends in an invalid UTF-8 byte sequence, the last byte. Being able to distinguish between a literal Unicode replacement character and invalid bytes would allow for code like:

use bstr::ByteSlice;
let buf = &b"xyz\FF"[..];
enum State {
    None,
    TrimUtf8Char,
    InvalidUtf8,
}
let mut state = State::None;
let chopped = buf.trim_end_with(|ch: Option<char>| match (&mut state, ch) {
    (state @ State::None, None) => {
        *state = State::InvalidUtf8;
        false
    }
    (state @State::None, Some(_)) => {
        *state = State::TrimUtf8Char;
        true
    }
    _ => false,
});
let buf = match state {
    State::None => buf,
    State::TrimUtf8Char => chopped,
    State::InvalidUtf8 => &buf[..buf.len() - 1][..],
};

Instead, using ByteSlice::char_indices is the only way to disambiguate between the Unicode replacement character and invalid UTF-8 byte sequence, so the code to implement this function looks like:

use bstr::ByteSlice;

const REPLACEMENT_CHARACTER_BYTES: [u8; 3] = [239, 191, 189];

let buf = &b"xyz\FF"[..];

let truncate_to = if let Some((start, end, ch)) = buf.char_indices().rev().next() {
    match ch {
        REPLACEMENT_CHARACTER if buf[start..end] == REPLACEMENT_CHARACTER_BYTES[..] => {
            // A literal Unicode replacement character is present in the
            // string, so remove the entire character
            start
        }
        // Invalid UTF-8, pop the last byte only.
        REPLACEMENT_CHARACTER => buf.len() - 1,
        // A valid UTF-8 character was found so remove the entire
        // character.
        _ => start,
    }
} else {
    0
};
let buf = &buf[..truncate_to][..]

I'm not sure which code I like better, but I end up not being able to use any of the char-oriented APIs because each invalid UTF-8 byte is considered a "character" for conventionally UTF-8 Ruby Strings.

The Ruby API behaves like this:

[2.6.6] > s = "abc\xFF"
=> "abc\xFF"
[2.6.6] > s.chop
=> "abc"
[2.6.6] > s = "abc"
=> "abc"
[2.6.6] > s.chop
=> "ab"
[2.6.6] > s = "abc�"
=> "abc�"
[2.6.6] > s.chop
=> "abc"
[2.6.6] > s = "💎"
=> "💎"
[2.6.6] > s.chop
=> ""

@BurntSushi
Copy link
Owner

@lopopolo Thanks for responding! There's a lot to unpack here I think...

So the first thing that sticks out to me is that the trim family functions don't really seem appropriate to use to implement chop in the first place. chop seems to only remove a single character, but the trim functions can remove an arbitrary number of characters.

As for Option<char>, while you're correct that char and Option<char> have the same size, that's not the only concern. The main concern there for me is that Option<char> is a lot less ergonomic to deal with. It's not particularly common to care about the precise invalid UTF-8 bytes and it's not particularly common to care whether a replacement codepoint is a literal replacement codepoint or if it was substituted automatically by the library. Now, in my view, it's important to be able to address those cases, but if they aren't ergonomic or are a little more complex to do so, then I'm generally okay with that personally. It would be a significant mistake, IMO, to make chars return Option<char> just so that a small minority of callers could avoid doing a small dance with char_indices.

I'm not sure which code I like better, but I end up not being able to use any of the char-oriented APIs because each invalid UTF-8 byte is considered a "character" for conventionally UTF-8 Ruby Strings.

Yeah that makes things a little tricky, particularly since bstr uses the "substitution of maximal subparts" strategy.

More generally, when I think about use cases for library routines, especially when it's about mild ergonomic improvements, I don't generally put too much weight into "implement another language's standard library API." For the most part, I think that's partially kicking the can down the road.

With all that said, I actually think you're missing the simplest implementation path here. It's not to use char_indices or trim (with the latter being impossible to use correctly for chop I think), but rather, bstr::decode_last_utf8:

fn chop(slice: &[u8]) -> &[u8] {
    let (cp, size) = bstr::decode_last_utf8(slice);
    let cp = match cp {
        Some(cp) => cp,
        None => return &slice[..slice.len().saturating_sub(1)],
    };
    let chopped = &slice[..slice.len()-size];
    if cp == '\n' && chopped.last() == Some(&b'\r') {
        return &chopped[..chopped.len()-1];
    }
    chopped
}

Playground link with assorted test cases: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=6da060e4e9fadcfab8418e1f29a30eb2

Cases like this are why decode_utf8 and decode_last_utf8 exist. They specifically don't do any kind of substitution automatically, and instead expose a flexible API that lets you do virtually anything you want. In the invalid UTF-8 case, for example, you're free to ignore size completely and just chop 1 byte instead of the maximal valid prefix. (That's what the b"abc\xE2\x98" test case covers.)

The chars and char_indices APIs are a bit higher level than this, and because you have a bit of a specialized case that really wants to care about the invalid UTF-8 specifically, you wind up having to unwind a bit of the ergonomics it gives you. And that gets the code all twisted up. But those ergonomics are there for good reasons. Forcing the case analysis of Option<char> on to everyone would stink. (We could add more iterators, but I think that's over-complicating things.)

@lopopolo
Copy link
Contributor

Thanks for taking the time to address my concerns @BurntSushi. It appears I missed bstr::decode_utf8 and bstr::decode_last_utf8. I've been exclusively looking at the docs for bstr::ByteSlice and I missed these 😊.

More generally, when I think about use cases for library routines, especially when it's about mild ergonomic improvements, I don't generally put too much weight into "implement another language's standard library API."

Understood that this is what I'm trying to do with artichoke and not what you're trying to do with bstr. The additional context that the APIs ByteSlice bias toward assuming that bytestrings are mostly valid UTF-8 helps clarify the shape of the APIs.

@lopopolo
Copy link
Contributor

Cases like this are why decode_utf8 and decode_last_utf8 exist. They specifically don't do any kind of substitution automatically, and instead expose a flexible API that lets you do virtually anything you want.

Oh wow you are right, these APIs do let me do exactly what I want. I'm a bit ashamed for missing these. Thanks for pointing me in the right direction.

@BurntSushi
Copy link
Owner

No problem! I guess they could also be routines in ByteSlice. I'm not quite sure why I made them free functions. Although, since they are free functions and accept AsRef<[u8]> as the input, they can actually decode codepoints from &str's too. But maybe I should add aliases to them on ByteSlice too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants