-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDATA deserialization #311
Comments
Waiting for tafia/quick-xml#311 Fixes: #230
From #244:
That's good API design but bad for performance, maybe we need something like |
This is quite unintuitive behaviour. Is there a quick workaround for this?
|
Btw this works with the current API: use quick_xml::{events::Event, Reader, Writer};
use std::{
fs,
io::{self, Cursor},
};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let file = fs::read_to_string("foo.xml")?;
let mut r = Reader::from_str(&path);
let mut w = Writer::new(Cursor::new(Vec::new()));
let mut buf = Vec::<u8>::new();
// Loop over the xml tags
loop {
match r.read_event(&mut buf) {
// Special case for CDATA
Ok(Event::CData(e)) => {
w.write(format!(
"<![CDATA[{}]]>\n",
str::from_utf8(&e.unescaped()?.into_owned())?
).as_bytes())?;
}
Ok(Event::Eof) => break,
Ok(e) => w.write_event(e)?,
Err(e) => panic!("Error at position {}: {:?}", r.buffer_position(), e),
}
buf.clear();
}
fs::write("foo.xml", w.into_inner().into_inner())?;
Ok(())
} |
Hello. I'm wonder how can i get actual text from cdata inside tag? |
You have a string representation of one XML inside other XML -- so you need to parse |
Thanks. It's worked, but Ok(Event::CData(t)) => {
s = t.unescape_and_decode(&r).unwrap();
if !s.is_empty() {
let mut ast = Vec::<usize>::new();
let tmp: &str = s.trim_end().trim_end_matches(|c| c == '*' || c == ',');
if let Some(i) = tmp.find('*') {
ast.push(i)
}
s = tmp.to_string();
if ast.len() > 0 {
for idx in &ast {
s.remove(*idx);
s.remove(*idx + 1);
}
}
return s;
}
} |
…tead of `BytesText` This commit revert changes from 85f9f68
…tead of `BytesText` This commit revert changes from 85f9f68
…tead of `BytesText` This commit revert changes from 85f9f68
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [quick-xml](https://github.com/tafia/quick-xml) | dependencies | patch | `0.23.0-alpha3` -> `0.23.0` | | [quick-xml](https://github.com/tafia/quick-xml) | dependencies | minor | `0.22.0` -> `0.23.0` | --- ### Release Notes <details> <summary>tafia/quick-xml</summary> ### [`v0.23.0`](https://github.com/tafia/quick-xml/blob/HEAD/Changelog.md#​0230----2022-05-08) - feat: add support for `i128` / `u128` in attributes or text/CDATA content - test: add tests for malformed inputs for serde deserializer - fix: allow to deserialize `unit`s from any data in attribute values and text nodes - refactor: unify errors when EOF encountered during serde deserialization - test: ensure that after deserializing all XML was consumed - feat: add `Deserializer::from_str`, `Deserializer::from_slice` and `Deserializer::from_reader` - refactor: deprecate `from_bytes` and `Deserializer::from_borrowing_reader` because they are fully equivalent to `from_slice` and `Deserializer::new` - refactor: reduce number of unnecessary copies when deserialize numbers/booleans/identifiers from the attribute and element names and attribute values - fix: allow to deserialize `unit`s from text and CDATA content. `DeError::InvalidUnit` variant is removed, because after fix it is no longer used - fix: `ElementWriter`, introduced in [#​274](tafia/quick-xml#274) (0.23.0-alpha2) now available to end users - fix: allow lowercase `<!doctype >` definition (used in HTML 5) when parse document from `&[u8]` - test: add tests for consistence behavior of buffered and borrowed readers - fix: produce consistent error positions in buffered and borrowed readers - feat: `Error::UnexpectedBang` now provide the byte found - refactor: unify code for buffered and borrowed readers - fix: fix internal panic message when parse malformed XML ([#​344](tafia/quick-xml#344)) - test: add tests for trivial documents (empty / only comment / `<root>...</root>` -- one tag with content) - fix: CDATA was not handled in many cases where it should - fix: do not unescape CDATA content because it never escaped by design. CDATA event data now represented by its own `BytesCData` type ([quick-xml#​311](tafia/quick-xml#311)) - feat: add `Reader::get_ref()` and `Reader::get_mut()`, rename `Reader::into_underlying_reader()` to `Reader::into_inner()` - refactor: now `Attributes::next()` returns a new type `AttrError` when attribute parsing failed ([#​4](Mingun/fast-xml#4)) - test: properly test all paths of attributes parsing ([#​4](Mingun/fast-xml#4)) - feat: attribute iterator now implements `FusedIterator` ([#​4](Mingun/fast-xml#4)) - fix: fixed many errors in attribute parsing using iterator, returned from `attributes()` or `html_attributes()` ([#​4](Mingun/fast-xml#4)) </details> --- ### Configuration 📅 **Schedule**: At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, click this checkbox. --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1377 Reviewed-by: crapStone <crapstone@noreply.codeberg.org> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
It seems that CDATA parsing performed incorrectly. CDATA should contain logically unescaped content (i.e. you don't need to unescape it). So the string
<![CDATA[escaped string]]>
should be parsed into stringescaped string
in contrast to the ordinal text content where the result should beescaped string
.According to the https://www.w3.org/TR/xml/#sec-cdata-sect CDATA sections cannot contain any escape sequences. I've checked several online validators (for example https://jsonformatter.org/xml-parser) and they agree with that.
However, quick-xml has different opinion. This test seems invalid:
quick-xml/tests/unit_tests.rs
Lines 179 to 184 in 516f1ea
As a consequence, all CDATA processing in the
Deserializer
is broken because it is simply handled the same as theText
event, and in several places handling of the CDATA just absent.I preparing a PR with these fixes, but because I'm not working with XML very often, I'm not sure that I'm not wrong. So I open this issue to get feedback.
The text was updated successfully, but these errors were encountered: