Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A way to get the row/column of a Reader #109

Open
jonas-schievink opened this issue Jan 23, 2018 · 17 comments · May be fixed by #552
Open

A way to get the row/column of a Reader #109

jonas-schievink opened this issue Jan 23, 2018 · 17 comments · May be fixed by #552

Comments

@jonas-schievink
Copy link
Contributor

xml-rs provides TextPosition for this, while quick-xml only has Reader::buffer_position() to get the byte position. It would be useful for error reporting if there was a way to compute the row and column from this (preferably without overhead if unused).

@jonas-schievink
Copy link
Contributor Author

This could be implemented somewhere along these lines:

impl<B: Seek + BufRead> Reader<B> {
    fn row_col(&self) -> (usize, usize) {
        // seek to 0, count lines until `buffer_position` is reached
    }
}

(add a proper return type and a better name to taste)

Of course, this is pretty slow as it re-reads the entire content.

@tafia
Copy link
Owner

tafia commented Jan 30, 2018

This looks like a good solution.
With probably a function renaming to show that this is very expensive.

@tafia
Copy link
Owner

tafia commented Mar 1, 2018

Out of curiousity, @jonas-schievink do you plan on doing a PR?

@jonas-schievink
Copy link
Contributor Author

Not right now, no. If anyone wants to implement this, feel free!

@DanielKeep
Copy link

I implemented a PositionRead wrapper type that builds a table for mapping byte offsets to line:column positions transparently, then added some extra methods to Reader for doing the lookup. That way, you don't pay anything if you don't use it.

That works fine; the problem is that buffer_position (which I was using to do the lookup) is rubbish for this. Depending on the exact state of the parser, it can end up pointing to before the event, at the start of the event, or even inside of it. That's to say nothing of errors where you just don't have a byte offset to do lookup on.

As such, I've had to abandon quick-xml for my current project. I need to be able to report back to users on where problems are, and I don't have time to do open-heart surgery on quick-xml to improve offset tracking.

I've uploaded my work-in-progress if anyone has a use for it.

@tafia
Copy link
Owner

tafia commented Mar 16, 2018

Thanks for the report.
The buffer_position may indeed need some rework. Do you have some particular tests that are failing at hand?

@DanielKeep
Copy link

test_multiline_pos in tests/unit_tests.rs was where I more or less threw up my hands. Before the second event, the Reader reports a buffer position of 10, which puts it after the opening < of the <?xml ...

The test_start_end_comment_pos test also has a fudge in it, where the position of the third event should be 1:27, but gets reported as 1:26 because the parser is still at the start of the whitespace before the <a /> element.

At a bare minimum, I think Reader would have to start tracking "last event position" separately from buffer (parsing) position. Maybe also use it for the position of errors. To do it properly, it would also have to start attaching correct positions to things that don't get distinct events (i.e. if there's an error in an attribute value, where is it?), but at that point, we're talking possibly significant API changes.

@tafia
Copy link
Owner

tafia commented Mar 16, 2018

Great. Thank you

@tafia
Copy link
Owner

tafia commented Mar 22, 2018

@DanielKeep I have created #123 which fixes an offset by one for buffer_position. On your branch, the 2 basic pos tests are passing but not the multiline_pos. I have tried to fix it but I think the errors come either from your test or from your changes.

I understand that you may not have time to spend on it but in case, I'll let it open for the moment and merge it later.

Thanks again for your feedback!

EDIT: All your comments about a proper way to manage buffer position are still valid. This is just a fix for the current situation. I'll try to find a way to explicitly manage positions (probably before managing row/columns).

@bbigras
Copy link

bbigras commented Oct 4, 2018

Any progress on this?

@Themayu
Copy link

Themayu commented Feb 11, 2023

I've begun work on something similar, which can possibly be used as a building block for this - adding Span objects to the different quick_xml::event::Bytes* structs to get where they are located in the underlying data source, if applicable. This might be usable as a base to compute the line and column of that event from.

@dralley
Copy link
Collaborator

dralley commented Feb 11, 2023

That sounds like a decent improvement. Separately on the question of calculating a line + column, it may not be terribly expensive to do SIMD-accelerated passes over the buffer continuously to keep track of how many lines have been seen previously. That would enable calculating the line + column without needing to "Seek" or have the entire document buffered, as you could start counting from the beginning of the current buffer and add it to what has been seen previously.

One issue to look out for is encodings. Byte positions may not perfectly match up depending on how they are calculated in relation to decoding of the contents,

@Mingun
Copy link
Collaborator

Mingun commented Feb 11, 2023

Actually, I have a prototype implementation for several mouths in my private branch which I've pushed right now. I've used approach with a feature that enables storing Spans in events. I would like to benchmark it to see if feature flag is needed at all. Also more tests needed and implementation for attributes.

Byte positions may not perfectly match up depending on how they are calculated in relation to decoding of the contents,

I do not think that this will be a problem because although we work with bytes, we ensure that our boundaries are correct (=not within a character). I think, that users of spans in any case will expect that this numbers are indexes in the original byte stream.

@Mingun Mingun linked a pull request Feb 11, 2023 that will close this issue
22 tasks
@Themayu
Copy link

Themayu commented Feb 11, 2023

Actually, I have a prototype implementation for several mouths in my private branch which I've pushed right now. I've used approach with a feature that enables storing Spans in events. I would like to benchmark it to see if feature flag is needed at all. Also more tests needed and implementation for attributes.

To clarify, does this mean that my work in #552 is unnecessary? Or would you like me to continue with it?

@Mingun
Copy link
Collaborator

Mingun commented Feb 11, 2023

Yes, as I can see, the parts that you've done in #552 I already implemented and I have feelings that they are worked right (but I didn't create tests for that yet, that's the one reason why I didn't make a PR).

I think, it would be better to use my approach, namely that parts:

  • span is used in PartialEq implementation but ignored in tests where it is not important
  • Option not needed, span 0..0 is used instead (this should decrease size of span field, and because we can expect that there would be many spans it is better to use as little space as we can)

You are free to finish my work if you like (you can rebase your branch over mine and update your PR if you wish). I'm not ready to finish it myself for now.

I do not insist on using feature-flag, I introduced it because spans can occupy some space and if you does not use them it is just waste of memory, but maybe the savings are not so great.

So the things that's should be implemented:

  • spans in attributes
  • spans in errors
  • tests
  • benchmark it to see if we need to hide it under feature-flag

@Themayu
Copy link

Themayu commented Feb 11, 2023

(you can rebase your branch over mine and update your PR if you wish).

How would I go about rebasing a branch on my fork over a branch on a different fork? Sorry, I'm not entirely well-versed in git usage; I've sort of been learning it as I go along.

  • span is used in PartialEq implementation but ignored in tests where it is not important

The primary reason for ignoring span in PartialEq was so that the addition of spans to the crate wouldn't break previously-working comparisons between Bytes* instances that came from different locations, such as comparing one from a reader with one constructed via new(), in transitive dependencies (if used in the dependent crate / other transitive dependencies) or even the dependent crate (if used in transitive dependencies). That it also made the tests work was a happy side-effect, but not the main intention.

@Mingun
Copy link
Collaborator

Mingun commented Feb 11, 2023

If you on Windows, then you're probably using TortoiseGit, then just use RMB-click on my commit in commit log and select Rebase "parser_position_tracking" onto this...(G). The other GUI clients should have the similar ways of doing that. If your prefer command-line, then I think you already known how to find required information.

wouldn't break previously-working comparisons between Bytes* instances that came from different locations,

I cannot imagine a reason why someone would be need to compare two events to compare their content or at all. I think we should not worry about that. PartialEq implemented mostly for our own purposes for tests. Of course, if you known the reason...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants