Problems with reading XML from buffering reader #469

aegoroff · 2022-09-01T19:18:19Z

Here is an example

fn main() {
    use quick_xml::events::Event;
    use quick_xml::reader::Reader;
    use std::io::BufReader;

    let xml = r###"<torrents>
    <torrent id="2" registred_at="2009.08.19 00:13:00" size="3">
    <title>Заголовок</title>
    <torrent hash="7096DB05CC2612B30079B26C94823DD8CA2A8156" tracker_id="3"/>
    <forum id="1"><![CDATA[<fo>]]></forum>
    <del/>
    <content><![CDATA[
        some content
    ]]></content>
    </torrent>
    </torrents>"###;

    let br = BufReader::with_capacity(32, xml.as_bytes());
    let mut reader = Reader::from_reader(br);
    reader.trim_text(true);

    let mut buf = Vec::new();
    let mut read_title = false;
    loop {
        match reader.read_event_into(&mut buf) {
            Ok(Event::Start(e)) if e.name().as_ref() == b"title" => {
                read_title = true;
            }
            Ok(Event::Text(e)) => {
                if read_title {
                    read_title = false;
                    let escaped = &e.into_inner();
                    let text = reader.decoder().decode(escaped).unwrap_or_default();
                    let title = String::from(text);
                    print!("{title}");
                }
            }
            Ok(Event::Eof) => break, // exits the loop when reaching end of file
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            _ => (), // There are several other `Event`s we do not consider here
        }
        buf.clear();
    }
}

when you run it you will crash with something like this

    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/qxerr`
thread 'main' panicked at 'Error at position 299: EndEventMismatch { expected: "forum", found: "content" }', src/main.rs:39:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Заголовок%

if buffer capacity increase to 512 (see let br = BufReader::with_capacity(32, xml.as_bytes());) the problem will gone. Replacing non Latin1 chars to latin1 ones will also solves the problem.

ADDITIONAL INFO

It was not a problem before v0.23, so v0.22 parsed such files correctly

The text was updated successfully, but these errors were encountered:

Mingun · 2022-09-02T09:09:59Z

The problem in that first CDATA section was extended to the end of the last CDATA section:

running 1 test
test issue469 ... FAILED

failures:

---- issue469 stdout ----
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Start(
        BytesStart { buf: Borrowed("torrents"), name_len: 8 },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Start(
        BytesStart { buf: Borrowed("torrent id=\"2\" registred_at=\"2009.08.19 00:13:00\" size=\"3\""), name_len: 7 },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Start(
        BytesStart { buf: Borrowed("title"), name_len: 5 },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Text(
        BytesText { content: Borrowed("0xD00x970xD00xB00xD00xB30xD00xBE0xD00xBB0xD00xBE0xD00xB20xD00xBE0xD00xBA") },
    ),
)
Заголовок
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    End(
        BytesEnd { name: Borrowed("title") },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Empty(
        BytesStart { buf: Borrowed("torrent hash=\"7096DB05CC2612B30079B26C94823DD8CA2A8156\" tracker_id=\"3\""), name_len: 7 },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    Start(
        BytesStart { buf: Borrowed("forum id=\"1\""), name_len: 5 },
    ),
)
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "[CDATA[<fo>]"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "[CDATA[<fo"
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "]></forum>0xA    <del/>0xA    <conte"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "]"
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "]></forum>0xA    <del/>0xA    <conte"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "]></forum"
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "]></forum>0xA    <del/>0xA    <conte"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "]></forum>0xA    <del/"
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "nt><![CDATA[0xA        some conten"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "nt"
[src/reader/mod.rs:762] crate::utils::Bytes(&chunk) = "t0xA    ]]></content>0xA    </torren"
[src/reader/mod.rs:763] crate::utils::Bytes(&chunk[..i]) = "t0xA    ]]"
[src/reader/parser.rs:108] crate::utils::Bytes(&buf) = "![CDATA[<fo>]]></forum>0xA    <del/>0xA    <content><![CDATA[0xA        some content0xA    "
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Ok(
    CData(
        BytesCData { content: Borrowed("<fo>]]></forum>0xA    <del/>0xA    <content><![CDATA[0xA        some content0xA    ") },
    ),
)
[tests/xmlrs_reader_tests.rs:546] reader.read_event_into(&mut buf) = Err(
    EndEventMismatch {
        expected: "forum",
        found: "content",
    },
)
thread 'issue469' panicked at 'Error at position 299: EndEventMismatch { expected: "forum", found: "content" }', tests/xmlrs_reader_tests.rs:560:23
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The problem does not related to non-latin1 characters, but probabilistic. With Russian characters the first CDATA end sequence ]]> was splitted in the middle, and that leaded to that this check evaluated to false:

quick-xml/src/reader/mod.rs

Lines 761 to 765 in 6bedf6c

    
           Self::CData => { 
        
               if chunk[..i].ends_with(b"]]") { 
        
                   return Some((&chunk[..i - 2], i + 1)); // +1 for `>` 
        
               } 
        
           }

aegoroff · 2022-09-02T21:32:08Z

let me guess the problem is in buffering reader. When XML with russian or other two or more bytes chars fit into buffer everything is fine but if such XML are spread between several chunks the problem occurs. Maybe there is an assume that char is always has one byte size but it's not true in case of UTF-8 encoding

Mingun · 2022-09-03T11:02:57Z

Yes, the problem in buffering reader only, but no, it exists also for one-byte characters. If you replace "Заголовок" (18 bytes in UTF-8) with the same number of ASCII chars (18 spaces for example), you will face with the same problem.

For all ASCII compatible encodings that supported by quick-xml it is not possible to split string inside multi-byte character, because we only split at boundaries of some ASCII characters, and their byte values never used as second byte in multi-byte encodings (actually, all supported encodings are 1-2-byte encodings). That is all encodings, except UTF-16BE, UTF-16LE, and ISO-2022-JP.

harscoet · 2022-09-05T10:27:14Z

I meet the same issue, is there any workaround or I need to downgrade the crate to 0.22?

Mingun · 2022-09-05T14:28:53Z

A possible workaround is to read into buffer and use borrowing version (Reader::read_event()).

Also you can help with working on a patch. If no one will provide one, I'll fix it myself on next weekend and release 0.25 with fix. In the fix I expect tests for all events (most of them affected) for both sync and async readers. The start point for a test:

#[test]
fn issue469() {
  let xml = "<![CDATA[1]]>";
  //         ^^^^^^^^^^^^ data that fit into buffer
  let size = xml.match_indices("]]").next().unwrap().0 + 2;
  let br = BufReader::with_capacity(size, xml.as_bytes());
  let mut reader = Reader::from_reader(br);
  let mut buf = Vec::new();

  assert_eq!(
    reader.read_event_into(&mut buf).unwrap(),
    Event::CData(BytesCData::new("1"))
  );
  assert_eq!(
    reader.read_event_into(&mut buf).unwrap(),
    Event::Eof
  );
}

The fix should be simple -- need to check the last one or two bytes in the buffer before consume it. If it is ]], --, ], -, ? or / do not consume that byte(s).

aegoroff · 2022-09-05T15:04:39Z

A possible workaround is to read into buffer and use borrowing version (Reader::read_event()).

i would do so but i cant because my xml is about 24Gb :)

Mingun · 2022-09-05T15:23:11Z

Probably you could use memory-mapped files to pretend that XML is in memory

aegoroff · 2022-09-05T15:32:48Z

Probably you could use memory-mapped files to pretend that XML is in memory

this cannot be done too - xml content unzipped on the fly from xz file and passed as stream to xml parser (using stdin)

Mingun · 2022-09-05T15:42:31Z

Well, I'm sure, that as exercise it could be possible to unzip to memory, map memory and parse from it, but probably that bug will be fixed before that Goldberg machine will work :)

failures (8): reader::async_tokio::test::small_buffers::cdata1 reader::async_tokio::test::small_buffers::cdata2 reader::async_tokio::test::small_buffers::comment1 reader::async_tokio::test::small_buffers::comment2 reader::buffered_reader::test::small_buffers::cdata1 reader::buffered_reader::test::small_buffers::cdata2 reader::buffered_reader::test::small_buffers::comment1 reader::buffered_reader::test::small_buffers::comment2

Fix #469 - parsing from buffered reader

Mingun added bug help wanted encoding Issues related to support of various encodings of the XML documents and removed encoding Issues related to support of various encodings of the XML documents labels Sep 2, 2022

Mingun changed the title ~~Problems with reading XML that contains non Latin1 UTF-8 chars using BufReader~~ Problems with reading XML from buffering reader Sep 5, 2022

Mingun mentioned this issue Sep 6, 2022

Fix #469 - parsing from buffered reader #471

Merged

Mingun closed this as completed in #471 Sep 10, 2022

Mingun added a commit that referenced this issue Sep 10, 2022

Merge pull request #471 from Mingun/fix-buffered-parsing

a10b1c3

Fix #469 - parsing from buffered reader

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with reading XML from buffering reader #469

Problems with reading XML from buffering reader #469

aegoroff commented Sep 1, 2022 •

edited by Mingun

Mingun commented Sep 2, 2022 •

edited

aegoroff commented Sep 2, 2022

Mingun commented Sep 3, 2022 •

edited

harscoet commented Sep 5, 2022

Mingun commented Sep 5, 2022 •

edited

aegoroff commented Sep 5, 2022

Mingun commented Sep 5, 2022

aegoroff commented Sep 5, 2022

Mingun commented Sep 5, 2022 •

edited

Problems with reading XML from buffering reader #469

Problems with reading XML from buffering reader #469

Comments

aegoroff commented Sep 1, 2022 • edited by Mingun

ADDITIONAL INFO

Mingun commented Sep 2, 2022 • edited

aegoroff commented Sep 2, 2022

Mingun commented Sep 3, 2022 • edited

harscoet commented Sep 5, 2022

Mingun commented Sep 5, 2022 • edited

aegoroff commented Sep 5, 2022

Mingun commented Sep 5, 2022

aegoroff commented Sep 5, 2022

Mingun commented Sep 5, 2022 • edited

aegoroff commented Sep 1, 2022 •

edited by Mingun

Mingun commented Sep 2, 2022 •

edited

Mingun commented Sep 3, 2022 •

edited

Mingun commented Sep 5, 2022 •

edited

Mingun commented Sep 5, 2022 •

edited