Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding module #439

Merged
merged 2 commits into from Jul 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions Changelog.md
Expand Up @@ -37,6 +37,8 @@
| |`resolve`
|`event_namespace` |`resolve_element`
|`attribute_namespace` |`resolve_attribute`
- [#439]: Added utilities `detect_encoding()`, `decode()`, and `decode_with_bom_removal()`
under the `quick-xml::encoding` namespace.


### Bug Fixes
Expand Down Expand Up @@ -209,6 +211,8 @@
[#431]: https://github.com/tafia/quick-xml/pull/431
[#434]: https://github.com/tafia/quick-xml/pull/434
[#437]: https://github.com/tafia/quick-xml/pull/437
[#439]: https://github.com/tafia/quick-xml/pull/439


## 0.23.0 -- 2022-05-08

Expand Down
2 changes: 1 addition & 1 deletion src/de/escape.rs
@@ -1,9 +1,9 @@
//! Serde `Deserializer` module

use crate::de::deserialize_bool;
use crate::encoding::Decoder;
use crate::errors::serialize::DeError;
use crate::escape::unescape;
use crate::reader::Decoder;
use serde::de::{DeserializeSeed, EnumAccess, VariantAccess, Visitor};
use serde::{self, forward_to_deserialize_any, serde_if_integer128};
use std::borrow::Cow;
Expand Down
2 changes: 1 addition & 1 deletion src/de/mod.rs
Expand Up @@ -215,10 +215,10 @@ mod var;

pub use crate::errors::serialize::DeError;
use crate::{
encoding::Decoder,
errors::Error,
events::{BytesCData, BytesEnd, BytesStart, BytesText, Event},
name::QName,
reader::Decoder,
Reader,
};
use serde::de::{self, Deserialize, DeserializeOwned, Visitor};
Expand Down
2 changes: 1 addition & 1 deletion src/de/seq.rs
@@ -1,6 +1,6 @@
use crate::de::{DeError, DeEvent, Deserializer, XmlRead};
use crate::encoding::Decoder;
use crate::events::BytesStart;
use crate::reader::Decoder;
use serde::de::{DeserializeSeed, SeqAccess};

/// Check if tag `start` is included in the `fields` list. `decoder` is used to
Expand Down
2 changes: 1 addition & 1 deletion src/de/simple_type.rs
Expand Up @@ -4,9 +4,9 @@
//! [as defined]: https://www.w3.org/TR/xmlschema11-1/#Simple_Type_Definition

use crate::de::{deserialize_bool, str2bool};
use crate::encoding::Decoder;
use crate::errors::serialize::DeError;
use crate::escape::unescape;
use crate::reader::Decoder;
use memchr::memchr;
use serde::de::{DeserializeSeed, Deserializer, EnumAccess, SeqAccess, VariantAccess, Visitor};
use serde::{self, serde_if_integer128};
Expand Down
187 changes: 187 additions & 0 deletions src/encoding.rs
@@ -0,0 +1,187 @@
//! A module for wrappers that encode / decode data.

use std::borrow::Cow;

#[cfg(feature = "encoding")]
use encoding_rs::{Encoding, UTF_16BE, UTF_16LE, UTF_8};

use crate::{Error, Result};

/// Decoder of byte slices into strings.
///
/// If feature `encoding` is enabled, this encoding taken from the `"encoding"`
/// XML declaration or assumes UTF-8, if XML has no <?xml ?> declaration, encoding
/// key is not defined or contains unknown encoding.
///
/// The library supports any UTF-8 compatible encodings that crate `encoding_rs`
/// is supported. [*UTF-16 is not supported at the present*][utf16].
///
/// If feature `encoding` is disabled, the decoder is always UTF-8 decoder:
/// any XML declarations are ignored.
///
/// [utf16]: https://github.com/tafia/quick-xml/issues/158
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub struct Decoder {
#[cfg(feature = "encoding")]
pub(crate) encoding: &'static Encoding,
}

impl Decoder {
pub(crate) fn utf8() -> Self {
Decoder {
#[cfg(feature = "encoding")]
encoding: UTF_8,
}
}

#[cfg(all(test, feature = "encoding", feature = "serialize"))]
pub(crate) fn utf16() -> Self {
Decoder { encoding: UTF_16LE }
}
}

#[cfg(not(feature = "encoding"))]
impl Decoder {
/// Decodes a UTF8 slice regardless of XML declaration and ignoring BOM if
/// it is present in the `bytes`.
///
/// Returns an error in case of malformed sequences in the `bytes`.
///
/// If you instead want to use XML declared encoding, use the `encoding` feature
#[inline]
pub fn decode<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> {
Ok(Cow::Borrowed(std::str::from_utf8(bytes)?))
}

/// Decodes a slice regardless of XML declaration with BOM removal if
/// it is present in the `bytes`.
///
/// Returns an error in case of malformed sequences in the `bytes`.
///
/// If you instead want to use XML declared encoding, use the `encoding` feature
pub fn decode_with_bom_removal<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> {
let bytes = if bytes.starts_with(&[0xEF, 0xBB, 0xBF]) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aesthetic preference, I feel like this is easier to read and better communicates the non-text-ness.

&bytes[3..]
} else {
bytes
};
self.decode(bytes)
}
}

#[cfg(feature = "encoding")]
impl Decoder {
/// Returns the `Reader`s encoding.
///
/// This encoding will be used by [`decode`].
///
/// [`decode`]: Self::decode
pub fn encoding(&self) -> &'static Encoding {
self.encoding
}

/// Decodes specified bytes using encoding, declared in the XML, if it was
/// declared there, or UTF-8 otherwise, and ignoring BOM if it is present
/// in the `bytes`.
///
/// Returns an error in case of malformed sequences in the `bytes`.
pub fn decode<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> {
decode(bytes, self.encoding)
}

/// Decodes a slice with BOM removal if it is present in the `bytes` using
/// the reader encoding.
///
/// If this method called after reading XML declaration with the `"encoding"`
/// key, then this encoding is used, otherwise UTF-8 is used.
///
/// If XML declaration is absent in the XML, UTF-8 is used.
///
/// Returns an error in case of malformed sequences in the `bytes`.
pub fn decode_with_bom_removal<'b>(&self, bytes: &'b [u8]) -> Result<Cow<'b, str>> {
self.decode(remove_bom(bytes, self.encoding))
}
}

/// Decodes the provided bytes using the specified encoding, ignoring the BOM
/// if it is present in the `bytes`.
///
/// Returns an error in case of malformed sequences in the `bytes`.
#[cfg(feature = "encoding")]
pub fn decode<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> Result<Cow<'b, str>> {
encoding
.decode_without_bom_handling_and_without_replacement(bytes)
.ok_or(Error::NonDecodable(None))
}

/// Decodes a slice with an unknown encoding, removing the BOM if it is present
/// in the bytes.
///
/// Returns an error in case of malformed sequences in the `bytes`.
#[cfg(feature = "encoding")]
pub fn decode_with_bom_removal<'b>(bytes: &'b [u8]) -> Result<Cow<'b, str>> {
if let Some(encoding) = detect_encoding(bytes) {
let bytes = remove_bom(bytes, encoding);
decode(bytes, encoding)
} else {
decode(bytes, UTF_8)
}
}

#[cfg(feature = "encoding")]
fn split_at_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> (&'b [u8], &'b [u8]) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant, because first part is not used anywhere. Why just cut off the beginning, as was before, is not enough?

Copy link
Collaborator Author

@dralley dralley Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but I was thinking that A) may want to add some variants that return the BOM the same way that we provide StartText and B) at the very least it would be useful for testing.

On the other hand, I kinda feel like both StartText and returning the BOM have limited utility in practice. But it feels like an open question.

I'll leave it as-is for now but I wouldn't be be upset if we end up removing it later.

if encoding == UTF_8 && bytes.starts_with(&[0xEF, 0xBB, 0xBF]) {
bytes.split_at(3)
} else if encoding == UTF_16LE && bytes.starts_with(&[0xFF, 0xFE]) {
bytes.split_at(2)
} else if encoding == UTF_16BE && bytes.starts_with(&[0xFE, 0xFF]) {
bytes.split_at(2)
} else {
(&[], bytes)
}
}

#[cfg(feature = "encoding")]
fn remove_bom<'b>(bytes: &'b [u8], encoding: &'static Encoding) -> &'b [u8] {
let (_, bytes) = split_at_bom(bytes, encoding);
bytes
}

/// Automatic encoding detection of XML files based using the
/// [recommended algorithm](https://www.w3.org/TR/xml11/#sec-guessing).
///
/// If encoding is detected, `Some` is returned, otherwise `None` is returned.
///
/// Because the [`encoding_rs`] crate supports only subset of those encodings, only
/// the supported subset are detected, which is UTF-8, UTF-16 BE and UTF-16 LE.
///
/// The algorithm suggests examine up to the first 4 bytes to determine encoding
/// according to the following table:
///
/// | Bytes |Detected encoding
/// |-------------|------------------------------------------
/// |`FE FF ## ##`|UTF-16, big-endian
/// |`FF FE ## ##`|UTF-16, little-endian
/// |`EF BB BF` |UTF-8
/// |-------------|------------------------------------------
/// |`00 3C 00 3F`|UTF-16 BE or ISO-10646-UCS-2 BE or similar 16-bit BE (use declared encoding to find the exact one)
/// |`3C 00 3F 00`|UTF-16 LE or ISO-10646-UCS-2 LE or similar 16-bit LE (use declared encoding to find the exact one)
/// |`3C 3F 78 6D`|UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably
#[cfg(feature = "encoding")]
pub fn detect_encoding(bytes: &[u8]) -> Option<&'static Encoding> {
match bytes {
// with BOM
_ if bytes.starts_with(&[0xFE, 0xFF]) => Some(UTF_16BE),
_ if bytes.starts_with(&[0xFF, 0xFE]) => Some(UTF_16LE),
_ if bytes.starts_with(&[0xEF, 0xBB, 0xBF]) => Some(UTF_8),

// without BOM
_ if bytes.starts_with(&[0x00, b'<', 0x00, b'?']) => Some(UTF_16BE), // Some BE encoding, for example, UTF-16 or ISO-10646-UCS-2
_ if bytes.starts_with(&[b'<', 0x00, b'?', 0x00]) => Some(UTF_16LE), // Some LE encoding, for example, UTF-16 or ISO-10646-UCS-2
_ if bytes.starts_with(&[b'<', b'?', b'x', b'm']) => Some(UTF_8), // Some ASCII compatible

_ => None,
}
}

// TODO: add some tests for functions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mind to add tests before merge?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it but I figured it will be easier to do a big testing push at the end. We've only got a few sample documents and entering the data manually would be painful.

Copy link
Collaborator Author

@dralley dralley Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, it needs to happen eventually - it would just be helpful to have the full picture of how encoding works together in mind while doing that work.

2 changes: 1 addition & 1 deletion src/events/mod.rs
Expand Up @@ -43,10 +43,10 @@ use std::fmt::{self, Debug, Formatter};
use std::ops::Deref;
use std::str::from_utf8;

use crate::encoding::Decoder;
use crate::errors::{Error, Result};
use crate::escape::{escape, partial_escape, unescape_with};
use crate::name::{LocalName, QName};
use crate::reader::Decoder;
use crate::utils::write_cow_string;
use attributes::{Attribute, Attributes};

Expand Down
4 changes: 3 additions & 1 deletion src/lib.rs
Expand Up @@ -44,6 +44,7 @@

#[cfg(feature = "serialize")]
pub mod de;
pub mod encoding;
mod errors;
mod escapei;
pub mod escape {
Expand All @@ -62,8 +63,9 @@ pub mod utils;
mod writer;

// reexports
pub use crate::encoding::Decoder;
#[cfg(feature = "serialize")]
pub use crate::errors::serialize::DeError;
pub use crate::errors::{Error, Result};
pub use crate::reader::{Decoder, NsReader, Reader};
pub use crate::reader::{NsReader, Reader};
pub use crate::writer::{ElementWriter, Writer};
4 changes: 2 additions & 2 deletions src/reader/buffered_reader.rs
Expand Up @@ -5,13 +5,13 @@ use std::fs::File;
use std::io::{self, BufRead, BufReader};
use std::path::Path;

use memchr;

use crate::errors::{Error, Result};
use crate::events::Event;
use crate::name::QName;
use crate::reader::{is_whitespace, BangType, ReadElementState, Reader, XmlSource};

use memchr;

/// This is an implementation of [`Reader`] for reading from a [`BufRead`] as
/// underlying byte stream.
impl<R: BufRead> Reader<R> {
Expand Down