Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write escaped string into a buffer #34

Open
lopopolo opened this issue Jan 30, 2020 · 4 comments
Open

Write escaped string into a buffer #34

lopopolo opened this issue Jan 30, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@lopopolo
Copy link
Contributor

lopopolo commented Jan 30, 2020

Hi @BurntSushi,

I'm using bstr for turning a Vec<u8>-like structure into debug strings and error messages. Specifically, I'm working on a Ruby implementation. In Ruby String is a Vec<u8> with a default UTF-8 encoding with no guarantees that the bytes are actually valid UTF-8.

bstr is the means by which I interpret these byte vectors as UTF-8 the best I can.

The fmt::Debug implementation on &BStr is very close to what I'd like, but I cannot use it because it wraps the escaped string in quotes. I need control of the output since these strings are being but into error messages.

I've put together this function for writing the escaped representation to an arbitrary fmt::Write (cribbing heavily form the fmt::Debug impl on &BStr).

pub fn escape_unicode<T>(mut f: T, string: &[u8]) -> Result<(), WriteError>
where
    T: fmt::Write,
{
    let buf = bstr::B(string);
    for (start, end, ch) in buf.char_indices() {
        if ch == '\u{FFFD}' {
            for byte in buf[start..end].as_bytes() {
                write!(f, r"\x{:X}", byte)?;
            }
        } else {
            write!(f, "{}", ch.escape_debug())?;
        }
    }
    Ok(())
}

Here's an example usage:

let mut message = String::from("undefined group name reference: \"");
string::escape_unicode(&mut message, name)?;
message.push('"');
Err(Exception::from(IndexError::new(interp, message)))

I'm trying to generate a message like this:

$ ruby -e 'm = /(.)/.match("a"); m["abc-\xFF"]'
Traceback (most recent call last):
	1: from -e:1:in `<main>'
-e:1:in `[]': undefined group name reference: "abc-\xFF" (IndexError)

Is this patch something you would consider upstreaming?

@BurntSushi
Copy link
Owner

This looks reasonableish, yes. I'd like to see its API cleaned up a bit. Namely:

  1. It looks like it should be named escape_debug instead of escape_unicode? Namely, escape_unicode in std converts everything to Unicode escapes.
  2. I think it should be named escape_debug_to since it writes to a fmt::Write. This leaves the door open to adding escape_debug implementations that mirror std, but this doesn't need to be in the initial PR.
  3. Add docs along with an example, consistent with the rest of the API. :-)

Thanks for the good idea!

@BurntSushi BurntSushi added the enhancement New feature or request label Jan 30, 2020
@lopopolo
Copy link
Contributor Author

Thanks. I’ll work on a PR tonight.

lopopolo added a commit to lopopolo/bstr that referenced this issue Feb 2, 2020
Add a method to the `ExtSlice` trait that supports writing the `BStr`
escaped representation into a `fmt::Write`. This enables extracting the
escaped `String` into a buffer without going through `fmt::Debug`.

The written `String` does not contain the surrounding quotes present in
the `fmt::Debug` implementation.

Fixes BurntSushiGH-34.
lopopolo added a commit to lopopolo/bstr that referenced this issue Feb 17, 2020
Add a method to the `ExtSlice` trait that supports writing the `BStr`
escaped representation into a `fmt::Write`. This enables extracting the
escaped `String` into a buffer without going through `fmt::Debug`.

The written `String` does not contain the surrounding quotes present in
the `fmt::Debug` implementation.

Fixes BurntSushiGH-34.

This change reimplements `fmt::Debug` for `BStr` with
`ExtSlice::escape_debug_into`.
@BurntSushi
Copy link
Owner

Apologies of leading you down the wrong path here, but as noted in #37, I think we should add APIs that mirror std for this as closely as possible. In particular, we should be able to have an escape_debug method that returns an iterator of char values corresponding to the escaped output. The iterator itself can implement fmt::Write for ergonomics.

This is harder to implement, but I think looking at std should give some inspiration. Note that there is an important difference between bstr and std here. std has an escape_debug impl for char, and since a str is just a sequence of encoded chars, its str::escape_debug method can simply defer to the char implementation. We can't really do that in bstr, so the implementation will need to be a bit different.

@Michael-J-Ward
Copy link

I'm sharing this because I believe that it's a step towards @BurntSushi 's proposed solution (just needs mapping from DebugItem -> Iterator<Item=char>, but is also useful for those that want a non-escaped debug string.

  enum DebugItem<'a> {
      NullByte,
      Escaped(core::char::EscapeDebug),
      HexedChar(char),
      HexedBytes(&'a [u8]),
  }

  impl<'a> std::fmt::Display for DebugItem<'a> {
      fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
          match self {
              DebugItem::NullByte => write!(f, "\\0"),
              DebugItem::Escaped(escaped) => write!(f, "{}", escaped),
              DebugItem::HexedBytes(bytes) => {
                  for &b in bytes.as_bytes() {
                      write!(f, r"\x{:02X}", b)?;
                  }
                  Ok(())
              },
              DebugItem::HexedChar(ch) => write!(f, "\\x{:02x}", *ch as u32),
              
          }
      }
  }

  fn iter_debug_items<'a>(debug_str: &'a BStr) -> impl Iterator<Item = DebugItem<'a>> {
      debug_str.char_indices()
          .map(|(s, e, ch)| {
              match ch {
                  '\0' => DebugItem::NullByte,
                  '\u{FFFD}' => {
                      let bytes = debug_str[s..e].as_bytes();
                      if bytes == b"\xEF\xBF\xBD" {
                          DebugItem::Escaped(ch.escape_debug())
                      } else {
                          DebugItem::HexedBytes(bytes)
                      }
                  }
                  // ASCII control characters except \0, \n, \r, \t
                  '\x01'..='\x08'
                  | '\x0b'
                  | '\x0c'
                  | '\x0e'..='\x19'
                  | '\x7f' => {
                      DebugItem::HexedChar(ch)
                  }
                  '\n' | '\r' | '\t' | _ => {
                      DebugItem::Escaped(ch.escape_debug())
                  }
              }
          })
  }

  impl fmt::Debug for BStr {
      #[inline]
      fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
          write!(f, "\"")?;
          for item in iter_debug_items(self) {
              write!(f, "{}", item)?;
          }
          write!(f, "\"")?;
          Ok(())
      }
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants