Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with reading XML from buffering reader #469

Closed
aegoroff opened this issue Sep 1, 2022 · 9 comments · Fixed by #471
Closed

Problems with reading XML from buffering reader #469

aegoroff opened this issue Sep 1, 2022 · 9 comments · Fixed by #471

Comments

@aegoroff
Copy link

aegoroff commented Sep 1, 2022

Here is an example

fn main() {
    use quick_xml::events::Event;
    use quick_xml::reader::Reader;
    use std::io::BufReader;

    let xml = r###"<torrents>
    <torrent id="2" registred_at="2009.08.19 00:13:00" size="3">
    <title>Заголовок</title>
    <torrent hash="7096DB05CC2612B30079B26C94823DD8CA2A8156" tracker_id="3"/>
    <forum id="1"><![CDATA[<fo>]]></forum>
    <del/>
    <content><![CDATA[
        some content
    ]]></content>
    </torrent>
    </torrents>"###;

    let br = BufReader::with_capacity(32, xml.as_bytes());
    let mut reader = Reader::from_reader(br);
    reader.trim_text(true);

    let mut buf = Vec::new();
    let mut read_title = false;
    loop {
        match reader.read_event_into(&mut buf) {
            Ok(Event::Start(e)) if e.name().as_ref() == b"title" => {
                read_title = true;
            }
            Ok(Event::Text(e)) => {
                if read_title {
                    read_title = false;
                    let escaped = &e.into_inner();
                    let text = reader.decoder().decode(escaped).unwrap_or_default();
                    let title = String::from(text);
                    print!("{title}");
                }
            }
            Ok(Event::Eof) => break, // exits the loop when reaching end of file
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            _ => (), // There are several other `Event`s we do not consider here
        }
        buf.clear();
    }
}

when you run it you will crash with something like this

    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/qxerr`
thread 'main' panicked at 'Error at position 299: EndEventMismatch { expected: "forum", found: "content" }', src/main.rs:39:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Заголовок%         

if buffer capacity increase to 512 (see let br = BufReader::with_capacity(32, xml.as_bytes());) the problem will gone. Replacing non Latin1 chars to latin1 ones will also solves the problem.

ADDITIONAL INFO

It was not a problem before v0.23, so v0.22 parsed such files correctly

@Mingun
Copy link
Collaborator

Mingun commented Sep 2, 2022

The problem in that first CDATA section was extended to the end of the last CDATA section:

running 1 test
test issue469 ... FAILED

failures:

---- issue469 stdout ----
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Start(
        BytesStart { buf: Borrowed("torrents"), name_len: 8 },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Start(
        BytesStart { buf: Borrowed("torrent id=\"2\" registred_at=\"2009.08.19 00:13:00\" size=\"3\""), name_len: 7 },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Start(
        BytesStart { buf: Borrowed("title"), name_len: 5 },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Text(
        BytesText { content: Borrowed("0xD00x970xD00xB00xD00xB30xD00xBE0xD00xBB0xD00xBE0xD00xB20xD00xBE0xD00xBA") },
    ),
)
Заголовок
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    End(
        BytesEnd { name: Borrowed("title") },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Empty(
        BytesStart { buf: Borrowed("torrent hash=\"7096DB05CC2612B30079B26C94823DD8CA2A8156\" tracker_id=\"3\""), name_len: 7 },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Start(
        BytesStart { buf: Borrowed("forum id=\"1\""), name_len: 5 },
    ),
)
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "[CDATA[<fo>]"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "[CDATA[<fo"
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "]></forum>0xA    <del/>0xA    <conte"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "]"
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "]></forum>0xA    <del/>0xA    <conte"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "]></forum"
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "]></forum>0xA    <del/>0xA    <conte"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "]></forum>0xA    <del/"
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "nt><![CDATA[0xA        some conten"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "nt"
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "t0xA    ]]></content>0xA    </torren"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "t0xA    ]]"
[src/reader/parser.rs:108] crate::utils::Bytes(&buf) = "![CDATA[<fo>]]></forum>0xA    <del/>0xA    <content><![CDATA[0xA        some content0xA    "
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    CData(
        BytesCData { content: Borrowed("<fo>]]></forum>0xA    <del/>0xA    <content><![CDATA[0xA        some content0xA    ") },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Err(
    EndEventMismatch {
        expected: "forum",
        found: "content",
    },
)
thread 'issue469' panicked at 'Error at position 299: EndEventMismatch { expected: "forum", found: "content" }', tests/xmlrs_reader_tests.rs:560:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The problem does not related to non-latin1 characters, but probabilistic. With Russian characters the first CDATA end sequence ]]> was splitted in the middle, and that leaded to that this check evaluated to false:

quick-xml/src/reader/mod.rs

Lines 761 to 765 in 6bedf6c

Self::CData => {
if chunk[..i].ends_with(b"]]") {
return Some((&chunk[..i - 2], i + 1)); // +1 for `>`
}
}

@Mingun Mingun added bug help wanted encoding Issues related to support of various encodings of the XML documents and removed encoding Issues related to support of various encodings of the XML documents labels Sep 2, 2022
@aegoroff
Copy link
Author

aegoroff commented Sep 2, 2022

let me guess the problem is in buffering reader. When XML with russian or other two or more bytes chars fit into buffer everything is fine but if such XML are spread between several chunks the problem occurs. Maybe there is an assume that char is always has one byte size but it's not true in case of UTF-8 encoding

@Mingun
Copy link
Collaborator

Mingun commented Sep 3, 2022

Yes, the problem in buffering reader only, but no, it exists also for one-byte characters. If you replace "Заголовок" (18 bytes in UTF-8) with the same number of ASCII chars (18 spaces for example), you will face with the same problem.

For all ASCII compatible encodings that supported by quick-xml it is not possible to split string inside multi-byte character, because we only split at boundaries of some ASCII characters, and their byte values never used as second byte in multi-byte encodings (actually, all supported encodings are 1-2-byte encodings). That is all encodings, except UTF-16BE, UTF-16LE, and ISO-2022-JP.

@harscoet
Copy link

harscoet commented Sep 5, 2022

I meet the same issue, is there any workaround or I need to downgrade the crate to 0.22?

@Mingun
Copy link
Collaborator

Mingun commented Sep 5, 2022

A possible workaround is to read into buffer and use borrowing version (Reader::read_event()).

Also you can help with working on a patch. If no one will provide one, I'll fix it myself on next weekend and release 0.25 with fix. In the fix I expect tests for all events (most of them affected) for both sync and async readers. The start point for a test:

#[test]
fn issue469() {
  let xml = "<![CDATA[1]]>";
  //         ^^^^^^^^^^^^ data that fit into buffer
  let size = xml.match_indices("]]").next().unwrap().0 + 2;
  let br = BufReader::with_capacity(size, xml.as_bytes());
  let mut reader = Reader::from_reader(br);
  let mut buf = Vec::new();

  assert_eq!(
    reader.read_event_into(&mut buf).unwrap(),
    Event::CData(BytesCData::new("1"))
  );
  assert_eq!(
    reader.read_event_into(&mut buf).unwrap(),
    Event::Eof
  );
}

The fix should be simple -- need to check the last one or two bytes in the buffer before consume it. If it is ]], --, ], -, ? or / do not consume that byte(s).

@Mingun Mingun changed the title Problems with reading XML that contains non Latin1 UTF-8 chars using BufReader Problems with reading XML from buffering reader Sep 5, 2022
@aegoroff
Copy link
Author

aegoroff commented Sep 5, 2022

A possible workaround is to read into buffer and use borrowing version (Reader::read_event()).

i would do so but i cant because my xml is about 24Gb :)

@Mingun
Copy link
Collaborator

Mingun commented Sep 5, 2022

Probably you could use memory-mapped files to pretend that XML is in memory

@aegoroff
Copy link
Author

aegoroff commented Sep 5, 2022

Probably you could use memory-mapped files to pretend that XML is in memory

this cannot be done too - xml content unzipped on the fly from xz file and passed as stream to xml parser (using stdin)

@Mingun
Copy link
Collaborator

Mingun commented Sep 5, 2022

Well, I'm sure, that as exercise it could be possible to unzip to memory, map memory and parse from it, but probably that bug will be fixed before that Goldberg machine will work :)

Mingun added a commit to Mingun/quick-xml that referenced this issue Sep 6, 2022
failures (8):
    reader::async_tokio::test::small_buffers::cdata1
    reader::async_tokio::test::small_buffers::cdata2
    reader::async_tokio::test::small_buffers::comment1
    reader::async_tokio::test::small_buffers::comment2
    reader::buffered_reader::test::small_buffers::cdata1
    reader::buffered_reader::test::small_buffers::cdata2
    reader::buffered_reader::test::small_buffers::comment1
    reader::buffered_reader::test::small_buffers::comment2
Mingun added a commit that referenced this issue Sep 10, 2022
Fix #469 - parsing from buffered reader
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants