Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve &str <-> &CStr16 lifecycle #73

Closed
HadrienG2 opened this issue Dec 14, 2018 · 9 comments
Closed

Improve &str <-> &CStr16 lifecycle #73

HadrienG2 opened this issue Dec 14, 2018 · 9 comments

Comments

@HadrienG2
Copy link
Contributor

HadrienG2 commented Dec 14, 2018

When interfacing Rust with UEFI, one of the API mismatches that we need to take care of is that Rust uses UTF-8 Pascal-style strings while UEFI uses UCS-2 null-terminated strings.

Currently, we have...

  • A little bit of code duplication on UTF-8 to UCS-2 conversions.
  • The std-inspired CStr16 type for handling borrowed UEFI strings more conveniently and safely.
  • Some API entry points that expect &CStr16.
  • Others that expect &str and silently convert to UCS-2.
    • In one occurence (File::open), this restricts possible inputs due to finite buffer size.
    • In all cases, this makes the API more confusing by mixing char conversions with other concerns.
    • In a few cases like logging, we actually don't have any other choice.

I expect interoperability with UCS-2 to be a common problem in realistic UEFI application, so we may want to improve upon this situation by 1/standardizing conversions between UTF-8 and UCS-2 and 2/eliminating "hidden" conversions between the two whenever possible.


Coming from this perspective, here are some ideas at the core uefi-rs layer...

  • One utility fn that takes an &str and an &mut [u16] as an input, and returns a &CStr16, a char conversion error, or a "buffer full" error as an output.
  • One reciprocal utility fn that goes from &CStr16 + &mut [u8] to &str or similar errors as above.

...and more ideas for the convenience uefi-exts layer, where allocations are allowed:

  • A TryFrom<&CStr16> implementation for String.
  • A CString16 type with a TryFrom<&str> implementation.

Once we have that, automagic conversion from &str to UCS-2 should be eliminated whenever possible, in favor of taking a &CStr16.

If the user has enabled uefi-services conveniences, then converting from &str to &CStr16 is just a .try_from() away. And if the user wants to stick with the core crate, calling the underlying conversion utilities is more involved, but not terribly difficult.

@HadrienG2
Copy link
Contributor Author

HadrienG2 commented Jan 8, 2019

I started working on this, eliminating some code duplication between Latin-1 and UCS-2 as a first step. Thinking about it further, this work might also be a good occasion to fix a tricky issue in the way we split text in our fmt::Write interface.

Warning: Unicode discussion ahead, please momentarily disregard your brain's sense of sanity.

As currently designed, our fmt::Write implementation writes as much UCS-2 code units as possible in a fixed-size stack-allocated u16 buffer, then adds a trailing NUL, sends the result to UEFI as a string to be printed out, and starts over. Since UCS-2 is a fixed-length encoding, this means that we are splitting the output string on code point boundaries.

Code points may not be the right boundary, however, because even UCS-2 has combining character sequences. For example, one way to encode the "â" French letter is a ['a', '^'] code point sequences. Splitting between these code points means that our next string will start with a lone '^' combining character, which the UEFI firmware is almost guaranteed to ignore or mis-render, so we don't want to split those sequences across printed string boundaries.

Another example of a char sequence that we probably don't want to split in order to keep UEFI's task easy is "\r\n". To keep this one together too, we need to split our strings on grapheme cluster boundaries, a slightly coarser granularity than combining character sequences which the no_std-compatible unicode_segmentation crate can delineate for us.

If we wanted to be really defensive about UEFI messing up cursive scripts such as Arabic and Devanagari, we could even go one step further and split on word boundaries. However, that would cause problems with long words (and non-word entities such as ASCII art) overflowing our buffer and clogging the output, so I'm willing to trust UEFI to handle split words correctly, and wait for buggy firmware reports from users of such scripts before taking any further defensive action.

All things considered, I think best equilibrium between simplicity, handling of diverse inputs, and defensive programming against diverging Unicode interpretation in UEFI firmware, is to split our output on grapheme cluster boundaries. I'll try to implement that as I extract the current fmt::Write implementation into a more general utility for streaming conversion from &str to &CStr16.

@HadrienG2
Copy link
Contributor Author

HadrienG2 commented Jan 10, 2019

I now have a satisfactory encoder (UTF-8 -> {Latin1, UCS2}), so I started experimenting with a decoder... and found an interesting issue. unicode-segmentation only works with UTF-8 strings, whereas the best I can make out of an UCS-2 string without memory allocation is an Iterator<Item=char>. Hmmm...

Here are the options that I can think about, in order of decreasing preference:

  1. Implement support for Iterator<Item=char> in unicode_segmentation. The code already uses this approach internally, so it is only a matter of exposing it in the API, and there has been some past demand for it.
  2. Convert as many chars as I can to UTF-8, then run the segmentation algorithm and reject the last grapheme cluster when not all input has been consumed (as that cluster may have been truncated). This means that we waste a bit of buffer space, but shouldn't be too bad in practice.
  3. Do not provide the same guarantees on encoding and decoding, and split on code point boundaries for decoding. I would rather not go there.

These approaches can be combined, for example by implementing fix (1) in unicode_segmentation and workaround (2) in uefi_rs, with the aim of eventually moving to (1) when the next release of unicode_segmentation comes out. This is probably what I'll try doing unless you have a better idea.

@GabrielMajeri
Copy link
Collaborator

GabrielMajeri commented Jan 10, 2019

I think exposing this in the unicode_segmentation API is the best solution, although I do not know how extensive these modifications would be for their API.

It definitely feels like we should maintain the UTF-8 <-> UCS-2 conversions in a separate crate to simplify testing and development. I don't know if there are other legacy APIs which still use UCS-2, but perhaps other Rust users will find a use for the code. I already have a ucs-2 crate, so tell me if you want to reuse the name and / or repo.

@HadrienG2
Copy link
Contributor Author

HadrienG2 commented Jan 10, 2019

I like the idea of extracting the legacy text handling to a crate, but not the scope of the ucs-2 crate:

  • Though mostly an UCS-2 API, UEFI uses Latin-1 in a couple of places (look up "Char8"). It is quite easy to write generic code that handles UCS-2 well and Latin-1 reasonably, and I think that's the best trade-off in this case.
  • I'm not sure if it makes that much sense to separate character encoding and basic string handling (aka "CStr"). There are things like line endings that you cannot handle without having a string-wide view.

With this in mind, I would propose to build a more general crate for fixed-width-char string handling, with an initial focus on Latin1/UCS-2 strings with Windows line endings. Would you like that?

@GabrielMajeri
Copy link
Collaborator

It sounds like an extensive project, but if you already have an idea for a more general crate then I see no issue with using it for uefi-rs.

@HadrienG2
Copy link
Contributor Author

HadrienG2 commented Jan 12, 2019

On the grapheme clustering front, I reached the conclusion that the unicode_segmentation logic probably cannot be adapted to Iterator<Item=char>, because its API and implementation uses UTF-8 byte indices extensively in order to patch together incomplete views of an input string, and AFAIK it is not possible to replicate all uses in an Iterator-based API.

So I'm probably going back to plan 2: encode a chunk of the UCS-2 input to UTF-8 and run the grapheme segmentation algorithm on this incomplete output.

@HadrienG2
Copy link
Contributor Author

Okay, 6+ weeks have elapsed and I have not managed to make any significant progress on this, or any of my other uefi-rs activities for that matter. I think it's fair to say that I cannot dedicate enough spare time to work on uefi-rs at the moment, and the situation seems unlikely to change in the next months.

In case anyone else feels like taking over this effort, I pushed a branch called str-cleanup with the current state of my unfinished work...

@nicholasbishop
Copy link
Contributor

Summarizing the current status a bit.

Of the utilities mentioned into the issue description, some of them now exist:

Cstr16::from_str_with_buf
CStr16::as_string
TryFrom<&str> for CString16

It be good to go through and fill in some additional conversions, like CStr16::to_str_with_buf maybe, as well as clean up naming (as_string should probably be to_string) and improve documentation.

It looks like there are still some implicit str -> CStr16 conversions in src/proto/media/file that could be cleaned up.

For the rest of the discussion about incorrectly splitting combining characters, that seems like its own big can of worms :)

nicholasbishop added a commit to nicholasbishop/uefi-rs that referenced this issue Feb 9, 2022
Since `CStr16` implements `Display`, the
[`ToString`](https://doc.rust-lang.org/alloc/string/trait.ToString.html)
trait is automatically implemented. That provides `to_string` as long as
`alloc` is enabled, so `as_string` isn't needed.

rust-osdev#73
GabrielMajeri pushed a commit that referenced this issue Feb 9, 2022
Since `CStr16` implements `Display`, the
[`ToString`](https://doc.rust-lang.org/alloc/string/trait.ToString.html)
trait is automatically implemented. That provides `to_string` as long as
`alloc` is enabled, so `as_string` isn't needed.

#73
nicholasbishop added a commit that referenced this issue Feb 18, 2022
Changed it to take a `&CStr16` rather than implicitly converting from a
`&str`.

#73
GabrielMajeri pushed a commit that referenced this issue Feb 19, 2022
Changed it to take a `&CStr16` rather than implicitly converting from a
`&str`.

#73
nicholasbishop added a commit to nicholasbishop/uefi-rs that referenced this issue Feb 21, 2022
The file info types are now constructed directly from a `&CStr16`
instead of internally converting from a `&str`.

rust-osdev#73
nicholasbishop added a commit to nicholasbishop/uefi-rs that referenced this issue Feb 22, 2022
The file info types are now constructed directly from a `&CStr16`
instead of internally converting from a `&str`.

rust-osdev#73
nicholasbishop added a commit to nicholasbishop/uefi-rs that referenced this issue Feb 24, 2022
The file info types are now constructed directly from a `&CStr16`
instead of internally converting from a `&str`. This also removes
`FileInfoCreationError::InvalidChar`, since that error is no longer
possible.

rust-osdev#73
GabrielMajeri pushed a commit that referenced this issue Feb 25, 2022
The file info types are now constructed directly from a `&CStr16`
instead of internally converting from a `&str`. This also removes
`FileInfoCreationError::InvalidChar`, since that error is no longer
possible.

#73
nicholasbishop added a commit to nicholasbishop/uefi-rs that referenced this issue Feb 27, 2022
The spec describes `LoadOptions` as "A pointer to the image’s binary
load options. See the OptionalData parameter in [section 3.1.3 Load
Options] of the Boot Manager chapter for information on the source of
the LoadOptions data." The `OptionalData` field is described as "The
remaining bytes in the load option descriptor are a binary data buffer
that is passed to the loaded image."

So there's no guarantee made about the contents of `LoadOptions`; it's
arbitrary binary data as far as the UEFI spec is concerned. So it's not
necessarily safe to treat the load_options as a `Char16` pointer, since
it might not be aligned and might not be null-terminated. That said, the
data is *usually* a null-terminated UCS-2 string. The UEFI Shell spec
specifies that command-line parameters are passed that way.

To safely support both cases, change the `load_options` field to a `u8`
pointer, drop the `load_options` method that converts the data to a
`&str`, and add two new methods: `load_options_as_bytes` and
`load_options_as_cstr16`. The `set_load_options` has also been changed
to take a `u8` pointer instead of a `Char16` pointer.

rust-osdev#73
nicholasbishop added a commit to nicholasbishop/uefi-rs that referenced this issue Feb 27, 2022
The spec describes `LoadOptions` as "A pointer to the image’s binary
load options. See the OptionalData parameter in [section 3.1.3 Load
Options] of the Boot Manager chapter for information on the source of
the LoadOptions data." The `OptionalData` field is described as "The
remaining bytes in the load option descriptor are a binary data buffer
that is passed to the loaded image."

So there's no guarantee made about the contents of `LoadOptions`; it's
arbitrary binary data as far as the UEFI spec is concerned. So it's not
necessarily safe to treat the load_options as a `Char16` pointer, since
it might not be aligned and might not be null-terminated. That said, the
data is *usually* a null-terminated UCS-2 string. The UEFI Shell spec
specifies that command-line parameters are passed that way.

To safely support both cases, change the `load_options` field to a `u8`
pointer, drop the `load_options` method that converts the data to a
`&str`, and add two new methods: `load_options_as_bytes` and
`load_options_as_cstr16`. The `set_load_options` has also been changed
to take a `u8` pointer instead of a `Char16` pointer.

rust-osdev#73
nicholasbishop added a commit to nicholasbishop/uefi-rs that referenced this issue Feb 27, 2022
The spec describes `LoadOptions` as "A pointer to the image’s binary
load options. See the OptionalData parameter in [section 3.1.3 Load
Options] of the Boot Manager chapter for information on the source of
the LoadOptions data." The `OptionalData` field is described as "The
remaining bytes in the load option descriptor are a binary data buffer
that is passed to the loaded image."

So there's no guarantee made about the contents of `LoadOptions`; it's
arbitrary binary data as far as the UEFI spec is concerned. So it's not
necessarily safe to treat the load_options as a `Char16` pointer, since
it might not be aligned and might not be null-terminated. That said, the
data is *usually* a null-terminated UCS-2 string. The UEFI Shell spec
specifies that command-line parameters are passed that way.

To safely support both cases, change the `load_options` field to a `u8`
pointer, drop the `load_options` method that converts the data to a
`&str`, and add two new methods: `load_options_as_bytes` and
`load_options_as_cstr16`. The `set_load_options` has also been changed
to take a `u8` pointer instead of a `Char16` pointer.

rust-osdev#73
nicholasbishop added a commit to nicholasbishop/uefi-rs that referenced this issue Feb 27, 2022
The spec describes `LoadOptions` as "A pointer to the image’s binary
load options. See the OptionalData parameter in [section 3.1.3 Load
Options] of the Boot Manager chapter for information on the source of
the LoadOptions data." The `OptionalData` field is described as "The
remaining bytes in the load option descriptor are a binary data buffer
that is passed to the loaded image."

So there's no guarantee made about the contents of `LoadOptions`; it's
arbitrary binary data as far as the UEFI spec is concerned. So it's not
necessarily safe to treat the load_options as a `Char16` pointer, since
it might not be aligned and might not be null-terminated. That said, the
data is *usually* a null-terminated UCS-2 string. The UEFI Shell spec
specifies that command-line parameters are passed that way.

To safely support both cases, change the `load_options` field to a `u8`
pointer, drop the `load_options` method that converts the data to a
`&str`, and add two new methods: `load_options_as_bytes` and
`load_options_as_cstr16`. The `set_load_options` has also been changed
to take a `u8` pointer instead of a `Char16` pointer.

rust-osdev#73
GabrielMajeri pushed a commit that referenced this issue Feb 27, 2022
The spec describes `LoadOptions` as "A pointer to the image’s binary
load options. See the OptionalData parameter in [section 3.1.3 Load
Options] of the Boot Manager chapter for information on the source of
the LoadOptions data." The `OptionalData` field is described as "The
remaining bytes in the load option descriptor are a binary data buffer
that is passed to the loaded image."

So there's no guarantee made about the contents of `LoadOptions`; it's
arbitrary binary data as far as the UEFI spec is concerned. So it's not
necessarily safe to treat the load_options as a `Char16` pointer, since
it might not be aligned and might not be null-terminated. That said, the
data is *usually* a null-terminated UCS-2 string. The UEFI Shell spec
specifies that command-line parameters are passed that way.

To safely support both cases, change the `load_options` field to a `u8`
pointer, drop the `load_options` method that converts the data to a
`&str`, and add two new methods: `load_options_as_bytes` and
`load_options_as_cstr16`. The `set_load_options` has also been changed
to take a `u8` pointer instead of a `Char16` pointer.

#73
@nicholasbishop
Copy link
Contributor

nicholasbishop commented Feb 27, 2022

I did some research on the combining-characters issue and, tl;dr I think we can safely ignore it.

Code points may not be the right boundary, however, because even UCS-2 has combining character sequences. For example, one way to encode the "â" French letter is a ['a', '^'] code point sequences. Splitting between these code points means that our next string will start with a lone '^' combining character, which the UEFI firmware is almost guaranteed to ignore or mis-render, so we don't want to split those sequences across printed string boundaries.

I did some tests with combining characters, and the two UEFI implementations I tried them on always ignored the combining character. For example, 0x61 is a lower case a, and 0x300 is a combining accent, so the UCS-2 string [0x0061, 0x0300] should render as à. You can see that working fine with UTF-8 in the playground. But when I tried it in QEMU with OVMF, it just renders as a without the accent. (Note that I'm referring to rendering in QEMU GUI window, not the output from the serial console.) If I instead use the UCS-2 string [0x00e0] it renders properly as à.

I tried running the test on real hardware as well. Test machine was a Lenovo Thinkpad X1 Carbon 9th Gen, and it also did not render the accent when written as a combining character.

Click to show test code
fn test_char_split(image: Handle, st: &SystemTable<Boot>) {
    let bt = st.boot_services();
    let output_handle = bt.find_handles::<Output>().unwrap_success()[0];
    let output = bt
        .open_protocol::<Output>(
            OpenProtocolParams {
                handle: output_handle,
                agent: image,
                controller: None,
            },
            OpenProtocolAttributes::Exclusive,
        )
        .unwrap_success();
    let output = unsafe { &mut *output.interface.get() };

    #[rustfmt::skip]
    let s = CStr16::from_u16_with_nul(&[
        // à
        0x00e0,
        // a
        0x0061,
        // combining accent
        0x0300,
        // CR
        0x0d,
        // LF
        0x0a,
        // Null
        0x0000
    ]).unwrap();

    output.output_string(s).unwrap_success();

    bt.stall(10_000_000);
}

Of course that's just two UEFI implementations -- it is entirely possible that some other implementations do handle combining characters, but in the absence of evidence of such implementations I don't think we need to bother with a more complex implementation of fmt::Write.

I also found evidence that conforming UCS-2 implementations are explicitly allowed to ignore combining characters. There are three implementation levels defined, and at level 1 all combining characters can be ignored. I couldn't find anything in the UEFI spec itself indicating what level is supposed to be supported, so it seems conforming UEFI firmware is fine using level 1.

It might be useful in some applications to normalize a string such that combining characters are converted to composite characters, but I think that doesn't need to be directly handled in uefi-rs itself.


Assuming I haven't gotten my reasoning above wrong, I think everything here is done. I've grepped the source for &.*str and I think we've now removed all the implicit conversions except for fmt::Write, which is fine to keep as is. We also have a reasonably convenient set of conversions between str/CStr16/CString16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants