Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ergonomics of String handling #26

Open
kbknapp opened this issue Mar 24, 2018 · 15 comments
Open

Ergonomics of String handling #26

kbknapp opened this issue Mar 24, 2018 · 15 comments
Labels
A-ergonomics Area: Ease of solving CLI related problems (e.g. filesystem interactions) C-tracking-issue Category: A tracking issue for an unstable feature

Comments

@kbknapp
Copy link
Contributor

kbknapp commented Mar 24, 2018

From the survey, string ergonomics are huge. Many, many CLI applications deal heavily with strings. In Rust, strings can be...difficult.

Granted, (IMO) Rust handles them correctly, sometimes the correctness doesn't actually matter for a given problem domain and just adds unnecessary gyration.

For example, we have:

  • (&)('static)str
  • Cow
  • (&)String
  • (&)OsStr
  • (&)OsString

It's understanding it can be overwhelming. My personal opinion is we should first tackle/discuss the ergonomics of using OsStr(ing) as it's heavily used on Linux (where paths may not contain valid UTF-8).

IMO OsStr should have the same user experience as &str/String.

We could also probably start by either listing known issues/inconsistencies or any current issue links/RFCs on the matter.

@kbknapp
Copy link
Contributor Author

kbknapp commented Mar 24, 2018

One of the quotes from the survey:

OsStr(ing) lacks starts_with/ends_with/contains APIs. This is a hard problem, I think, but also seems like something we're perfectly capable of solving if we wanted to, and seeing to_string_lossy().starts_with(...) or, worse in some ways, to_str().unwrap().starts_with() looks really bad to me from a compatibility standpoint. Like, why am I using OsStr if I have to pay this cost anyway?

@BurntSushi
Copy link

This RFC extends the Pattern API to OsStr: rust-lang/rfcs#2295

@epage
Copy link
Contributor

epage commented Mar 24, 2018

Is there anything like format! for OsString? If I want to pass a path as a command line argument via process::Command, I either need to build it up manually (joining "--flag" and my path) or convert the path to a String and use format!. So I'm going from OsString to String back to OsString, losing the ability to track files that need the OsString.

@epage
Copy link
Contributor

epage commented Mar 24, 2018

There was easy_strings. For rapid prototyping, I wonder if we should evaluate making "as easy as python" (e.g. internally clone or Arc everything, reducing borrows) versions of common vocabulary terms. failure::Error and easy_strings are some examples. Paths are another opportunity for this.

Are there any others?

@kbknapp
Copy link
Contributor Author

kbknapp commented Mar 25, 2018

If I want to pass a path as a command line argument via process::Command, I either need to build it up manually (joining "--flag" and my path) or convert the path to a String and use format!. So I'm going from OsString to String back to OsString, losing the ability to track files that need the OsString.

I agree - but in this particular case your only option is to build it manually, as using format! and going from OsStr->String->OsStr doesn't do anything. If you went to String at all you're either losing invalid UTF-8 (using _lossy()), or panic!ing, meaning you didn't need OsStr in the first place.

easy_strings

This looks awesome and I wasn't aware of it! There's also

  • string-wrapper for stack allocated strings.
  • string for the storage backing being generic (and/or stack allocated)

IMO, if there was a single "easy by default" string which I could later tune/drop-in replace with std Strings I'd be all for it. Drop in replace is the hard part though, as all functionality must be replicated (traits, etc.) for the "easy don't care" string.

@jeekobu
Copy link

jeekobu commented May 18, 2018

There is some discussion of the hurdles of mixing Path/PathBuf with String in rust-lang/rust#49868

@epage
Copy link
Contributor

epage commented May 28, 2018

There was easy_strings. For rapid prototyping, I wonder if we should evaluate making "as easy as python" (e.g. internally clone or Arc everything, reducing borrows) versions of common vocabulary terms. failure::Error and easy_strings are some examples. Paths are another opportunity for this.

Along these lines, an "ergonomic" string type that supports multiplying against integers would be useful like in Python

CC @Aaronepower

@epage
Copy link
Contributor

epage commented Jul 13, 2018

... I always assumed OsStr let you convert to/from bytes.

  • It turns out it only supports converting to/from str. This means there is no underlying type to use to manipulate them.
  • It doesn't support any kind of way to walk the parts and manipulate them.

@BurntSushi
Copy link

I always assumed OsStr let you convert to/from bytes.

You can, on Unix only though: https://doc.rust-lang.org/std/os/unix/ffi/trait.OsStrExt.html

On Windows, all you can get are the UTF-16 bytes (which is the raw representation): https://doc.rust-lang.org/std/os/windows/ffi/trait.OsStrExt.html

On Unix, OsStrs are, AIUI, zero cost in that they represent the bytes from the platform as-is. On Windows, OsStrs are always transcoded to and from UTF-16 at the boundaries (with WTF-8 as the internal representation).

@BurntSushi
Copy link

I've written a bit about this topic as it relates to byte strings: https://docs.rs/bstr/0.1.2/bstr/#file-paths-and-os-strings

While it seems like we will eventually get string-like APIs on OsStr, that still won't be enough. Consider the perhaps somewhat common case of trying to match a file path against a regex. The regex machinery cannot know the internal representation of an OsStr, so you're only real choice is to lossily convert it to UTF-8 on Windows and use the raw bytes on Unix. But the standard library doesn't make this particular use case easy and currently requires writing platform specific code.

@BurntSushi
Copy link

BurntSushi commented Apr 10, 2019

I also spent a little bit of time looking at how other ecosystems handle this. For example, in the Go world, it will lossily convert Windows file paths to UTF-8 at the very lowest levels, so it's impossible to roundtrip file paths on Windows that contain invalid UTF-16 in Go. Nevertheless, despite searching for it, I could find no practical reports of this being a problem. I'm not sure what, if any, conclusions we can draw from that.

@epage
Copy link
Contributor

epage commented Apr 11, 2019

  1. One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8 by accessing their underlying 16-bit integer representation. Unfortunately, this isn't zero cost (it introduces a second WTF-8 decoding step) and it's not clear this is a good thing to do, since WTF-8 should ideally remain an internal implementation detail.
  2. One could instead declare that they will not handle paths on Windows that are not valid UTF-16, and return an error when one is encountered.
  3. Like (2), but instead of returning an error, lossily decode the file path on Windows that isn't valid UTF-16 into UTF-16 by replacing invalid bytes with the Unicode replacement codepoint.

So I fully admit, this is not an area I've looked into. When I see "lossy" wrt unicode, I expect tofu to be inserted. This makes me concerned that a string search might not match when it should and if I want to add a suffix to a non-UTF-8 file, it'll now look like garbage to the user. Is this accurate? If so, then that is why I'd be interested in (1) ... somehow. If its not and only non-visible bytes are dropped, then 2/3 sound reasonable but I feel clarifying the behavior could be helpful for people concerned like me.

@BurntSushi
Copy link

When I see "lossy" wrt unicode, I expect tofu to be inserted.

It's the Unicode replacement codepoint, \uFFFD, which looks like this: . Have you read this section in the docs on handling invalid UTF-8? It goes into a fair bit of detail on how exactly this process is carried out.

This makes me concerned that a string search might not match when it should

In cases like these, it's generally helpful to construct an example. The case where bstr's approach would fail when searching would have to meet all of these criteria:

  • You come across a file path that is not valid UTF-16 on Windows.
  • The substring you're searching for is itself invalid UTF-16.
  • The substring you're looking for actually occurs within the invalid UTF-16 file path.

Notably, this is Windows only. Unix works fine.

I'll also note that ripgrep has this bug, and I've never gotten a bug report. (Presumably, ripgrep has a substantial user base on Windows, since it ships with VS Code.) Of course, absence of evidence is not evidence of absence, but I'm an engineer, not a theoretician. :-)

if I want to add a suffix to a non-UTF-8 file, it'll now look like garbage to the user. Is this accurate?

Did you see this part of the docs?

On Windows, these conversion routines perform a UTF-8 check and either return an error or lossily decode the file path into valid UTF-8, depending on which function you use. This means that you cannot roundtrip all file paths on Windows correctly using these conversion routines. However, this may be an acceptable downside since such file paths are exceptionally rare. Moreover, roundtripping isn't always necessary, for example, if all you're doing is filtering based on file paths.

Could you say more about what is confusing here?

If so, then that is why I'd be interested in (1) ... somehow.

(1) would solve the failure case I described above with respect to substring/regex/glob search, presuming your regex/glob is constructed in such a way to match arbitrary bytes. It's a bit tenuous, but I don't think you can do any better for this specific case, other than perhaps dealing with the u16 UCS-2 code units directly.

However, at least for byte strings, I don't think this solves the roundtrip problem elegantly. The problem is that byte strings are arbitrary bytes, so the conversion from &BStr back to an &OsStr would only be defined for the subset of byte strings that are valid WTF-8, which is of course not a property maintained by byte strings themselves. It's likely that the path manipulation you're after would be better supported by enriching the API of OsStr/OsString itself, because then you can guarantee that all transformations maintain the WTF-8 invariant. (And you avoid the extra roundtrips between byte strings and OS strings, which require checks in both directions.)

With all that said, roundtripping invalid UTF-16 file paths on Windows is a precarious proposition. Consider, for example, a program that has the complex job of merely printing file paths as part of its output. If you are in the unenviable position of needing to deal with invalid UTF-16 file paths, then it's quite possible that you can't even print them correctly as output, because Windows consoles generally barf on anything that isn't valid UTF-16. That's why Rust's standard library will return an error if you attempt to write invalid UTF-8 to stdout. So in Windows, "roundtripping" a file path is really limited to "have a file path, change it in some way, and then use it in file system APIs."

If its not and only non-visible bytes are dropped, then 2/3 sound reasonable but I feel clarifying the behavior could be helpful for people concerned like me.

I don't think "non-visible" is the correct characterization here. The bytes that are dropped are only meaningful with respect to a specific encoding, where as "visibility" is really a property of a character itself.

TL;DR - The byte string approach is basically arguing to not handle the case of invalid UTF-16 Windows paths by either lossily transcoding them (in which roundtripping can subtlely fail, but searching generally works, modulo the corner case mentioned above) or by returning an error (that is surfaced to the user). Lossy transcoding is basically the acceptance that these file paths are rare, and that an errant substring search is likely even rarer. Returning an error is basically telling the end user to fix their file paths.

@epage
Copy link
Contributor

epage commented Apr 20, 2019

So first, before you pointed it out, I did not notice the large section you wrote in bstr's docs on the topic. I was mostly going off what little I've noticed in the stdlib.

And yes, it is important to consider the application and what is (1) the likelihood of running into problems and (2) what is the right solution for it.

The concerns in my post were written from remembering the concerns but not fully remembering the application. I've had more time to think on it and my biggest of concern is writing my own path library. I have two goals I'm oscillating between (1) cleanly abstracting the best path-related crates like bstr does or (2) provide a rapid prototyping crate. The latter includes the abstraction but is willing to sacrifice performance for making things easy to manipulate.

So from this perspective

  • I need to pass the decision on lossy or error up to the user
  • This means that my only choice is to use OsString and rely on PathBuf to concatenate OsStrings since there is no way to manipulate WTF-8 without sacrificing (1) cross platform or (2) giving the user the choice in how to handle it
  • I'd love to experiment with alternative API designs to find a more ergonomic way of handling all of this
    • I was really hoping to make the choice between lossy or error unnecessary by allowing byte-string version of format! and friends that can be used with WTF-8.

@epage epage added C-tracking-issue Category: A tracking issue for an unstable feature A-ergonomics Area: Ease of solving CLI related problems (e.g. filesystem interactions) labels Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ergonomics Area: Ease of solving CLI related problems (e.g. filesystem interactions) C-tracking-issue Category: A tracking issue for an unstable feature
Projects
None yet
Development

No branches or pull requests

4 participants