Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix errors in sequence deserialization #387

Merged
merged 6 commits into from Jun 5, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/rust.yml
Expand Up @@ -40,6 +40,10 @@ jobs:
env:
LLVM_PROFILE_FILE: coverage/serialize-escape-html-%p-%m.profraw
run: cargo test --features serialize,escape-html
- name: Run tests (all features)
env:
LLVM_PROFILE_FILE: coverage/all-features-%p-%m.profraw
run: cargo test --all-features
Copy link
Collaborator

@dralley dralley Jun 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should cover testing both with and without overlapped-lists which is a concern because of all the alternate code 👍

Copy link
Collaborator Author

@Mingun Mingun Jun 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is why I've added this.

- name: Prepare coverage information for upload
if: runner.os == 'Linux'
# --token is required by grcov, but not required by coveralls.io, so pass
Expand Down
46 changes: 46 additions & 0 deletions Cargo.toml
Expand Up @@ -42,6 +42,52 @@ default = []
## [standard compliant]: https://www.w3.org/TR/xml11/#charencoding
encoding = ["encoding_rs"]

## This feature enables support for deserializing lists where tags are overlapped
## with tags that do not correspond to the list.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about the use case, is this basically for situations where you're parsing messy handwritten XML / HTML? Generally I would think machine-generated XML wouldn't have this problem.

It might be worth providing that as an example of a situation where you might want to enable it if that is the case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I cannot remember that would I worked with such XMLs, but all parsers, that I know, very tolerant to overlapped lists (even in xs:sequence, although XSD requires exact order of tags. I never seen using xs:all that allows arbitrary order). So I thought there should be an option to allow this behavior

Copy link

@k-bx k-bx Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dralley I'm trying to parse KML file (from Google Earth Pro) which can have elements of types Folder, Document and Point interleaved in a single list. I'd like to ideally get them in the same order, so I tried to create an enum covering all three cases, but it didn't work. So this at least gives me an option to parse it as three lists without the preservation of order in-between the items.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k-bx, if you case still didn't work, could you describe it in more detail? That is what that definitely should work

Copy link

@k-bx k-bx Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mingun it works good enough, but not perfectly. The desired way to handle things is:

input: <TagOne></TagOne><TagTwo></TagTwo><TagOne></TagOne>

parsed as:

struct TagOne {};
struct TagTwo {};
enum Tag {
  TagOne(TagOne),
  TagTwo(TagTwo)
}

-- result should be
vec![Tag::TagOne(TagOne {}), Tag::TagTwo(TagTwo {}), Tag::TagOne(TagOne {})]

What you get is a perfect preservation of the order in which you get your mixed tags.

I didn't see a way to do something like this via the current lib ^

Do I need to create a separate issue to make something like this possible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is an error, I filled #500 for it

##
## When this feature is enabled, the XML:
## ```xml
## <any-name>
## <item/>
## <another-item/>
## <item/>
## <item/>
## </any-name>
## ```
## could be deserialized to a struct:
## ```ignore
## #[derive(Deserialize)]
## #[serde(rename_all = "kebab-case")]
## struct AnyName {
## item: Vec<()>,
## another_item: (),
## }
## ```
##
## When this feature is not enabled (default), only the first element will be
## associated with the field, and the deserialized type will report an error
## (duplicated field) when the deserializer encounters a second `<item/>`.
##
## Note, that enabling this feature can lead to high and even unlimited memory
## consumption, because deserializer should check all events up to the end of a
## container tag (`</any-name>` in that example) to figure out that there are no
## more items for a field. If `</any-name>` or even EOF is not encountered, the
## parsing will never end which can lead to a denial-of-service (DoS) scenario.
##
## Having several lists and overlapped elements for them in XML could also lead
## to quadratic parsing time, because the deserializer must check the list of
## events as many times as the number of sequence fields present in the schema.
##
## To reduce negative consequences, always [limit] the maximum number of events
## that [`Deserializer`] will buffer.
##
## This feature works only with `serialize` feature and has no effect if `serialize`
## is not enabled.
##
## [limit]: crate::de::Deserializer::event_buffer_size
## [`Deserializer`]: crate::de::Deserializer
overlapped-lists = []

## Enables support for [`serde`] serialization and deserialization
serialize = ["serde"]

Expand Down
11 changes: 11 additions & 0 deletions Changelog.md
Expand Up @@ -10,10 +10,19 @@

## Unreleased

### New Features

- [#387]: Allow overlapping between elements of sequence and other elements
(using new feature `overlapped-lists`)

### Bug Fixes

- [#9]: Deserialization erroneously was successful in some cases where error is expected.
This broke deserialization of untagged enums which rely on error if variant cannot be parsed
- [#387]: Allow to have an ordinary elements together with a `$value` field
- [#387]: Internal deserializer state can be broken when deserializing a map with
a sequence field (such as `Vec<T>`), where elements of this sequence contains
another sequence. This error affects only users with the `serialize` feature enabled

### Misc Changes

Expand All @@ -36,9 +45,11 @@
### New Tests

- [#9]: Added tests for incorrect nested tags in input
- [#387]: Added a bunch of tests for sequences deserialization

[#8]: https://github.com/Mingun/fast-xml/pull/8
[#9]: https://github.com/Mingun/fast-xml/pull/9
[#387]: https://github.com/tafia/quick-xml/pull/387
[#391]: https://github.com/tafia/quick-xml/pull/391

## 0.23.0 -- 2022-05-08
Expand Down