Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip BOM from the event stream, add a method to Writer for writing BOM #459

Merged
merged 3 commits into from Aug 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
20 changes: 6 additions & 14 deletions Changelog.md
Expand Up @@ -20,8 +20,6 @@
- [#180]: Make `Decoder` struct public. You already had access to it via the
`Reader::decoder()` method, but could not name it in the code. Now the preferred
way to access decoding functionality is via this struct
- [#191]: New event variant `StartText` emitted for bytes before the XML declaration
or a start comment or a tag. For streams with BOM this event will contain a BOM
- [#395]: Add support for XML Schema `xs:list`
- [#324]: `Reader::from_str` / `Deserializer::from_str` / `from_str` now ignore
the XML declared encoding and always use UTF-8
Expand All @@ -42,7 +40,7 @@
- [#450]: Added support of asynchronous [tokio](https://tokio.rs/) readers
- [#455]: Change return type of all `read_to_end*` methods to return a span between tags
- [#455]: Added `Reader::read_text` method to return a raw content (including markup) between tags

- [#459]: Added a `Writer::write_bom()` method for inserting a Byte-Order-Mark into the document.

### Bug Fixes

Expand Down Expand Up @@ -100,15 +98,6 @@
`Decoder::decode()` and `Decoder::decode_with_bom_removal()`.
Use `reader.decoder().decode_*(...)` instead of `reader.decode_*(...)` for now.
`Reader::encoding()` is replaced by `Decoder::encoding()` as well
- [#191]: Remove poorly designed `BytesText::unescape_and_decode_without_bom()` and
`BytesText::unescape_and_decode_without_bom_with_custom_entities()`. Although these methods worked
as expected, this was only due to good luck. They was replaced by the
`BytesStartText::decode_with_bom_removal()`:
- conceptually, you should decode BOM only for the first `Text` event from the
reader (since now `StartText` event is emitted instead for this)
- text before the first tag is not an XML content at all, so it is meaningless
to try to unescape something in it

- [#180]: Eliminated the differences in the decoding API when feature `encoding` enabled and when it is
disabled. Signatures of functions are now the same regardless of whether or not the feature is
enabled, and an error will be returned instead of performing replacements for invalid characters
Expand Down Expand Up @@ -186,11 +175,14 @@
- [#440]: Removed `Deserializer::from_slice` and `quick_xml::de::from_slice` methods because deserializing from a byte
array cannot guarantee borrowing due to possible copying while decoding.

- [#455]: Removed `Reader::read_text_into` which is only not a better wrapper over match on `Event::Text`
- [#455]: Removed `Reader::read_text_into` which is just a thin wrapper over match on `Event::Text`

- [#456]: Reader and writer stuff grouped under `reader` and `writer` modules.
You still can use re-exported definitions from a crate root

- [#459]: Made the `Writer::write()` method non-public as writing random bytes to a document is not generally useful or desirable.
- [#459]: BOM bytes are no longer emitted as `Event::Text`. To write a BOM, use `Writer::write_bom()`.

### New Tests

- [#9]: Added tests for incorrect nested tags in input
Expand Down Expand Up @@ -234,7 +226,7 @@
[#450]: https://github.com/tafia/quick-xml/pull/450
[#455]: https://github.com/tafia/quick-xml/pull/455
[#456]: https://github.com/tafia/quick-xml/pull/456

[#459]: https://github.com/tafia/quick-xml/pull/459

## 0.23.0 -- 2022-05-08

Expand Down
14 changes: 0 additions & 14 deletions benches/microbenches.rs
Expand Up @@ -118,20 +118,6 @@ fn read_resolved_event_into(c: &mut Criterion) {
/// Benchmarks, how fast individual event parsed
fn one_event(c: &mut Criterion) {
let mut group = c.benchmark_group("One event");
group.bench_function("StartText", |b| {
let src = "Hello world!".repeat(512 / 12);
b.iter(|| {
let mut r = Reader::from_str(&src);
let mut nbtxt = criterion::black_box(0);
r.check_end_names(false).check_comments(false);
match r.read_event() {
Ok(Event::StartText(e)) => nbtxt += e.len(),
something_else => panic!("Did not expect {:?}", something_else),
};

assert_eq!(nbtxt, 504);
})
});

group.bench_function("Start", |b| {
let src = format!(r#"<hello target="{}">"#, "world".repeat(512 / 5));
Expand Down
8 changes: 0 additions & 8 deletions src/de/mod.rs
Expand Up @@ -929,10 +929,6 @@ impl<'i, R: BufRead> XmlRead<'i> for IoReader<R> {
let event = loop {
let e = self.reader.read_event_into(&mut self.buf)?;
match e {
//TODO: Probably not the best idea treat StartText as usual text
// Usually this event will represent a BOM
// Changing this requires review of the serde-de::top_level::one_element test
Event::StartText(e) => break Ok(DeEvent::Text(e.into_owned().into())),
Event::Start(e) => break Ok(DeEvent::Start(e.into_owned())),
Event::End(e) => break Ok(DeEvent::End(e.into_owned())),
Event::Text(e) => break Ok(DeEvent::Text(e.into_owned())),
Expand Down Expand Up @@ -974,10 +970,6 @@ impl<'de> XmlRead<'de> for SliceReader<'de> {
loop {
let e = self.reader.read_event()?;
match e {
//TODO: Probably not the best idea treat StartText as usual text
// Usually this event will represent a BOM
// Changing this requires review of the serde-de::top_level::one_element test
Event::StartText(e) => break Ok(DeEvent::Text(e.into())),
Event::Start(e) => break Ok(DeEvent::Start(e)),
Event::End(e) => break Ok(DeEvent::End(e)),
Event::Text(e) => break Ok(DeEvent::Text(e)),
Expand Down
28 changes: 20 additions & 8 deletions src/encoding.rs
Expand Up @@ -9,6 +9,18 @@ use encoding_rs::{Encoding, UTF_16BE, UTF_16LE, UTF_8};
use crate::Error;
use crate::Result;

/// Unicode "byte order mark" (\u{FEFF}) encoded as UTF-8.
/// See <https://unicode.org/faq/utf_bom.html#bom1>
pub(crate) const UTF8_BOM: &[u8] = &[0xEF, 0xBB, 0xBF];
/// Unicode "byte order mark" (\u{FEFF}) encoded as UTF-16 with little-endian byte order.
/// See <https://unicode.org/faq/utf_bom.html#bom1>
#[cfg(feature = "encoding")]
pub(crate) const UTF16_LE_BOM: &[u8] = &[0xFF, 0xFE];
/// Unicode "byte order mark" (\u{FEFF}) encoded as UTF-16 with big-endian byte order.
/// See <https://unicode.org/faq/utf_bom.html#bom1>
#[cfg(feature = "encoding")]
pub(crate) const UTF16_BE_BOM: &[u8] = &[0xFE, 0xFF];

/// Decoder of byte slices into strings.
///
/// If feature `encoding` is enabled, this encoding taken from the `"encoding"`
Expand Down Expand Up @@ -62,7 +74,7 @@ impl Decoder {
///
/// If you instead want to use XML declared encoding, use the `encoding` feature
pub fn decode_with_bom_removal<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> {
let bytes = if bytes.starts_with(&[0xEF, 0xBB, 0xBF]) {
let bytes = if bytes.starts_with(UTF8_BOM) {
&bytes[3..]
} else {
bytes
Expand Down Expand Up @@ -131,19 +143,19 @@ pub fn decode_with_bom_removal<'b>(bytes: &'b [u8]) -> Result<Cow<'b, str>> {

#[cfg(feature = "encoding")]
fn split_at_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> (&'b [u8], &'b [u8]) {
if encoding == UTF_8 && bytes.starts_with(&[0xEF, 0xBB, 0xBF]) {
if encoding == UTF_8 && bytes.starts_with(UTF8_BOM) {
bytes.split_at(3)
} else if encoding == UTF_16LE && bytes.starts_with(&[0xFF, 0xFE]) {
} else if encoding == UTF_16LE && bytes.starts_with(UTF16_LE_BOM) {
bytes.split_at(2)
} else if encoding == UTF_16BE && bytes.starts_with(&[0xFE, 0xFF]) {
} else if encoding == UTF_16BE && bytes.starts_with(UTF16_BE_BOM) {
bytes.split_at(2)
} else {
(&[], bytes)
}
}

#[cfg(feature = "encoding")]
fn remove_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> &'b [u8] {
pub(crate) fn remove_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> &'b [u8] {
let (_, bytes) = split_at_bom(bytes, encoding);
bytes
}
Expand Down Expand Up @@ -172,9 +184,9 @@ fn remove_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> &'b [u8] {
pub fn detect_encoding(bytes: &[u8]) -> Option<&'static Encoding> {
match bytes {
// with BOM
_ if bytes.starts_with(&[0xFE, 0xFF]) => Some(UTF_16BE),
_ if bytes.starts_with(&[0xFF, 0xFE]) => Some(UTF_16LE),
_ if bytes.starts_with(&[0xEF, 0xBB, 0xBF]) => Some(UTF_8),
_ if bytes.starts_with(UTF16_BE_BOM) => Some(UTF_16BE),
_ if bytes.starts_with(UTF16_LE_BOM) => Some(UTF_16LE),
_ if bytes.starts_with(UTF8_BOM) => Some(UTF_8),

// without BOM
_ if bytes.starts_with(&[0x00, b'<', 0x00, b'?']) => Some(UTF_16BE), // Some BE encoding, for example, UTF-16 or ISO-10646-UCS-2
Expand Down
122 changes: 0 additions & 122 deletions src/events/mod.rs
Expand Up @@ -50,69 +50,6 @@ use crate::name::{LocalName, QName};
use crate::utils::write_cow_string;
use attributes::{Attribute, Attributes};

/// Text that appeared before an XML declaration, a start element or a comment.
///
/// In well-formed XML it could contain a Byte-Order-Mark (BOM). If this event
/// contains something else except BOM, the XML should be considered ill-formed.
///
/// This is a reader-only event. If you need to write a text before the first tag,
/// use the [`BytesText`] event.
#[derive(Debug, Clone, Eq, PartialEq)]
pub struct BytesStartText<'a> {
content: BytesText<'a>,
}

impl<'a> BytesStartText<'a> {
/// Converts the event into an owned event.
pub fn into_owned(self) -> BytesStartText<'static> {
BytesStartText {
content: self.content.into_owned(),
}
}

/// Extracts the inner `Cow` from the `BytesStartText` event container.
#[inline]
pub fn into_inner(self) -> Cow<'a, [u8]> {
self.content.into_inner()
}

/// Converts the event into a borrowed event.
#[inline]
pub fn borrow(&self) -> BytesStartText {
BytesStartText {
content: self.content.borrow(),
}
}

/// Decodes bytes of event, stripping byte order mark (BOM) if it is presented
/// in the event.
///
/// This method does not unescapes content, because no escape sequences can
/// appeared in the BOM or in the text before the first tag.
pub fn decode_with_bom_removal(&self) -> Result<String> {
//TODO: Fix lifetime issue - it should be possible to borrow string
let decoded = self.content.decoder.decode_with_bom_removal(&*self)?;

Ok(decoded.to_string())
}
}

impl<'a> Deref for BytesStartText<'a> {
type Target = BytesText<'a>;

fn deref(&self) -> &Self::Target {
&self.content
}
}

impl<'a> From<BytesText<'a>> for BytesStartText<'a> {
fn from(content: BytesText<'a>) -> Self {
Self { content }
}
}

////////////////////////////////////////////////////////////////////////////////////////////////////

/// Opening tag data (`Event::Start`), with optional attributes.
///
/// `<name attr="value">`.
Expand Down Expand Up @@ -797,12 +734,6 @@ impl<'a> Deref for BytesText<'a> {
}
}

impl<'a> From<BytesStartText<'a>> for BytesText<'a> {
fn from(content: BytesStartText<'a>) -> Self {
content.content
}
}

////////////////////////////////////////////////////////////////////////////////////////////////////

/// CDATA content contains unescaped data from the reader. If you want to write them as a text,
Expand Down Expand Up @@ -941,56 +872,6 @@ impl<'a> Deref for BytesCData<'a> {
/// [`Reader::read_event_into`]: crate::reader::Reader::read_event_into
#[derive(Clone, Debug, Eq, PartialEq)]
pub enum Event<'a> {
/// Text that appeared before the first opening tag or an [XML declaration].
/// [According to the XML standard][std], no text allowed before the XML
/// declaration. However, if there is a BOM in the stream, some data may be
/// present.
///
/// When this event is generated, it is the very first event emitted by the
/// [`Reader`], and there can be the only one such event.
///
/// The [`Writer`] writes content of this event "as is" without encoding or
/// escaping. If you write it, it should be written first and only one time
/// (but writer does not enforce that).
///
/// # Examples
///
/// ```
/// # use pretty_assertions::assert_eq;
/// use std::borrow::Cow;
/// use quick_xml::events::Event;
/// use quick_xml::reader::Reader;
///
/// // XML in UTF-8 with BOM
/// let xml = b"\xEF\xBB\xBF<?xml version='1.0'?>".as_ref();
/// let mut reader = Reader::from_reader(xml);
/// let mut buf = Vec::new();
/// let mut events_processed = 0;
/// loop {
/// match reader.read_event_into(&mut buf) {
/// Ok(Event::StartText(e)) => {
/// assert_eq!(events_processed, 0);
/// // Content contains BOM
/// assert_eq!(e.into_inner(), Cow::Borrowed(b"\xEF\xBB\xBF"));
/// }
/// Ok(Event::Decl(_)) => {
/// assert_eq!(events_processed, 1);
/// }
/// Ok(Event::Eof) => {
/// assert_eq!(events_processed, 2);
/// break;
/// }
/// e => panic!("Unexpected event {:?}", e),
/// }
/// events_processed += 1;
/// }
/// ```
///
/// [XML declaration]: Event::Decl
/// [std]: https://www.w3.org/TR/xml11/#NT-document
/// [`Reader`]: crate::reader::Reader
/// [`Writer`]: crate::writer::Writer
StartText(BytesStartText<'a>),
/// Start tag (with attributes) `<tag attr="value">`.
Start(BytesStart<'a>),
/// End tag `</tag>`.
Expand Down Expand Up @@ -1018,7 +899,6 @@ impl<'a> Event<'a> {
/// buffer used when reading but incurring a new, separate allocation.
pub fn into_owned(self) -> Event<'static> {
match self {
Event::StartText(e) => Event::StartText(e.into_owned()),
Event::Start(e) => Event::Start(e.into_owned()),
Event::End(e) => Event::End(e.into_owned()),
Event::Empty(e) => Event::Empty(e.into_owned()),
Expand All @@ -1036,7 +916,6 @@ impl<'a> Event<'a> {
#[inline]
pub fn borrow(&self) -> Event {
match self {
Event::StartText(e) => Event::StartText(e.borrow()),
Event::Start(e) => Event::Start(e.borrow()),
Event::End(e) => Event::End(e.borrow()),
Event::Empty(e) => Event::Empty(e.borrow()),
Expand All @@ -1056,7 +935,6 @@ impl<'a> Deref for Event<'a> {

fn deref(&self) -> &[u8] {
match *self {
Event::StartText(ref e) => &*e,
Event::Start(ref e) | Event::Empty(ref e) => &*e,
Event::End(ref e) => &*e,
Event::Text(ref e) => &*e,
Expand Down
33 changes: 21 additions & 12 deletions src/reader/mod.rs
Expand Up @@ -286,7 +286,7 @@ pub type Span = Range<usize>;
/// subgraph _
/// direction LR
///
/// Init -- "(no event)"\nStartText --> OpenedTag
/// Init -- "(no event)"\n --> OpenedTag
/// OpenedTag -- Decl, DocType, PI\nComment, CData\nStart, Empty, End --> ClosedTag
/// ClosedTag -- "#lt;false#gt;\n(no event)"\nText --> OpenedTag
/// end
Expand All @@ -297,13 +297,13 @@ pub type Span = Range<usize>;
#[derive(Clone)]
enum ParseState {
/// Initial state in which reader stay after creation. Transition from that
/// state could produce a `StartText`, `Decl`, `Comment` or `Start` event.
/// The next state is always `OpenedTag`. The reader will never return to this
/// state. The event emitted during transition to `OpenedTag` is a `StartEvent`
/// if the first symbol not `<`, otherwise no event are emitted.
/// state could produce a `Text`, `Decl`, `Comment` or `Start` event. The next
/// state is always `OpenedTag`. The reader will never return to this state. The
/// event emitted during transition to `OpenedTag` is a `StartEvent` if the
/// first symbol not `<`, otherwise no event are emitted.
Init,
/// State after seeing the `<` symbol. Depending on the next symbol all other
/// events (except `StartText`) could be generated.
/// events could be generated.
///
/// After generating one event the reader moves to the `ClosedTag` state.
OpenedTag,
Expand Down Expand Up @@ -552,9 +552,7 @@ impl<R> Reader<R> {
read_event_impl!(self, buf, read_until_open, read_until_close)
}

/// Read until '<' is found and moves reader to an `OpenedTag` state.
///
/// Return a `StartText` event if `first` is `true` and a `Text` event otherwise
dralley marked this conversation as resolved.
Show resolved Hide resolved
/// Read until '<' is found, moves reader to an `OpenedTag` state and returns a `Text` event.
fn read_until_open<'i, B>(&mut self, buf: B, first: bool) -> Result<Event<'i>>
where
R: XmlSource<'i, B>,
Expand Down Expand Up @@ -1574,12 +1572,23 @@ mod test {
use pretty_assertions::assert_eq;

#[$test]
$($async)? fn start_text() {
let mut reader = Reader::from_str("bom");
$($async)? fn text_at_start() {
let mut reader = Reader::from_str("text");

assert_eq!(
reader.$read_event($buf) $(.$await)? .unwrap(),
Event::StartText(BytesText::from_escaped("bom").into())
Event::Text(BytesText::from_escaped("text").into())
);
}

#[$test]
#[should_panic] // Failure is expected until read_until_open() is smart enough to skip over irrelevant text events.
$($async)? fn bom_at_start() {
let mut reader = Reader::from_str("\u{feff}");

assert_eq!(
reader.$read_event($buf) $(.$await)? .unwrap(),
Event::Eof
);
}

Expand Down