Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is Decoder::decode returning a Cow? #512

Closed
ok-nick opened this issue Nov 23, 2022 · 2 comments
Closed

Why is Decoder::decode returning a Cow? #512

ok-nick opened this issue Nov 23, 2022 · 2 comments
Labels
encoding Issues related to support of various encodings of the XML documents question

Comments

@ok-nick
Copy link

ok-nick commented Nov 23, 2022

According to:

pub fn decode<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> {
#[cfg(not(feature = "encoding"))]
let decoded = Ok(Cow::Borrowed(std::str::from_utf8(bytes)?));
#[cfg(feature = "encoding")]
let decoded = decode(bytes, self.encoding);
decoded
}
}

Decoder::decode returns an owned Cow when feature encoding is enabled, otherwise a borrowed one. How come there aren't two definitions for decode, one defined for feature encoding that returns an owned String and one defined without the feature that returns a borrowed str?

I ran across this issue trying to return a borrowed string (or bytes) from a function, but unfortunately the Cow is owned, so I couldn't. My only solution would be to return the owned Cow, then further pattern match on that to get the borrowed value out. If I wanted to convert the Cow to bytes, it would be even more of a hassle.

On a side note, would it be useful to have a function to return the raw bytes rather than checking its valid UTF-8? I took a quick peek around the function and it seems quick-xml has an additional abstraction that pipes encodings to encoding_rs. Maybe the user should handle it themselves for more flexibility? Or perhaps quick-xml should be split into a lower-level crate that handles the bare bones of XML parsing, then have an abstraction on top for this kind of flexibility?

@Mingun Mingun added question encoding Issues related to support of various encodings of the XML documents labels Nov 23, 2022
@Mingun
Copy link
Collaborator

Mingun commented Nov 23, 2022

How come there aren't two definitions for decode, one defined for feature encoding that returns an owned String and one defined without the feature that returns a borrowed str?

That is how it was implemented before and we specially change it in #180 because API that changes depending on the feature flags is hard to use -- features are additive and different dependencies can enable different set of features.

I ran across this issue trying to return a borrowed string (or bytes) from a function, but unfortunately the Cow is owned, so I couldn't. My only solution would be to return the owned Cow, then further pattern match on that to get the borrowed value out. If I wanted to convert the Cow to bytes, it would be even more of a hassle.

If you get an owned Cow that means that some decoding was performed which requires allocation. I cannot understand how you expect to get a borrowed string here. That is only possible if memory would be owned by decode function (which is impossible) or it will leak.

On a side note, would it be useful to have a function to return the raw bytes rather than checking its valid UTF-8?

That is how it worked internally, decoding is done only when you get content of events, which require you to manually call corresponding function. So if you don't need content you'll not pay for decoding (but this is dangerous because you can pass invalid documents). I cannot imagine a situation where you'll need raw bytes of XML stream.

Or perhaps quick-xml should be split into a lower-level crate that handles the bare bones of XML parsing, then have an abstraction on top for this kind of flexibility?

Parsing of XML should be done after decoding. XML is defined in terms of characters, not bytes. Right now we do the opposite and this is the case why UTF-16 and ISO-2022-JP are not supported. For the other supported encodings the are no difference in practice, because them are ASCII-compatible and all XML markup characters are ASCII.

@ok-nick
Copy link
Author

ok-nick commented Nov 23, 2022

I was responding when I realized read_to_text internally calls read_to_end and then decode. The bytes are technically exposed through this function so I can avoid the Cow.

If you get an owned Cow that means that some decoding was performed which requires allocation. I cannot understand how you expect to get a borrowed string here. That is only possible if memory would be owned by decode function (which is impossible) or it will leak.

Sorry, I should've been a little more clear on this. I had an owned Cow that contains a borrowed value, so Cow::Borrowed(...).

@ok-nick ok-nick closed this as completed Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding Issues related to support of various encodings of the XML documents question
Projects
None yet
Development

No branches or pull requests

2 participants