Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IllegalStateException on Closing Tags #113

Open
Josh-ES opened this issue Jul 15, 2020 · 2 comments
Open

IllegalStateException on Closing Tags #113

Josh-ES opened this issue Jul 15, 2020 · 2 comments
Labels
pr-welcome Issue for which progress most likely if someone submits a Pull Request

Comments

@Josh-ES
Copy link

Josh-ES commented Jul 15, 2020

When parsing an XML feed like the below, with an unbalanced close tag, an error gets thrown at the following line which can be caught, and parsing is able to continue:

throwParseError("Unbalanced close tag </"+name+">; no open start tag.");

<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
  <CD>
    <TITLE>Empire Burlesque</TITLE>
    <ARTIST>Bob Dylan</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Columbia</COMPANY>
    <PRICE>10.90</PRICE>
    <YEAR>1985</YEAR>
  </CD>
  </CD>
  <CD>
    <TITLE>Hide your heart</TITLE>
    <ARTIST>Bonnie Tyler</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>CBS Records</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1988</YEAR>
  </CD>
  <CD>
    <TITLE>Greatest Hits</TITLE>
    <ARTIST>Dolly Parton</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>RCA</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1982</YEAR>
  </CD>
  <CD>
    <TITLE>Still got the blues</TITLE>
    <ARTIST>Gary Moore</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Virgin records</COMPANY>
    <PRICE>10.20</PRICE>
    <YEAR>1990</YEAR>
  </CD>
</CATALOG>

If, however, an XML feed with an unbalanced close tag that has no preceding white space is encountered, a second exception is thrown which cannot be recovered from.

<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
  <CD>
    <TITLE>Empire Burlesque</TITLE>
    <ARTIST>Bob Dylan</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Columbia</COMPANY>
    <PRICE>10.90</PRICE>
    <YEAR>1985</YEAR>
  </CD></CD>
  <CD>
    <TITLE>Hide your heart</TITLE>
    <ARTIST>Bonnie Tyler</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>CBS Records</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1988</YEAR>
  </CD>
  <CD>
    <TITLE>Greatest Hits</TITLE>
    <ARTIST>Dolly Parton</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>RCA</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1982</YEAR>
  </CD>
  <CD>
    <TITLE>Still got the blues</TITLE>
    <ARTIST>Gary Moore</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Virgin records</COMPANY>
    <PRICE>10.20</PRICE>
    <YEAR>1990</YEAR>
  </CD>
</CATALOG>

What's happening is that if such an unbalanced close tag error is caught, and then reader.next() is called subsequently, the stream reader's current event type continues to be set to whatever the previous event type was. If an unbalanced close tag was preceded by some whitespace characters, the current event type is set to characters i.e. mCurrToken == XMLEvent.CHARACTERS. Calling reader.next() reads the unbalanced close tag as if they were characters with no problem, and then parsing can continue from there.

If the unbalanced close tag was preceded by another close tag, it treats the unbalanced close tag as a close tag and ends up hitting the below line:

if (!mElementStack.pop()) { // false if root closed

This throws an IllegalStateException from the line below. If you try to call reader.next() after catching this error, it continues trying to parse the unbalanced close tag as a close tag, and continuously throws this error.

throw new IllegalStateException("Popping from empty stack");

Our use case is unusual, as we are trying to parse XML documents in fragments and so encounter this case quite regularly. Right now, our workaround is to extend from the Woodstox stream reader and manually set the event type to characters if we encounter this error. That doesn't sound like a solution that would work for all use cases, but it works for us. However, shouldn't it be possible to skip over a close tag if no corresponding opening tag has been seen by the parser, with the default Woodstox parser? If anyone has an idea on how to solve this, I'd be happy to contribute a PR.

try {
    return next();
} catch (IllegalStateException ex) {
    if (!ex.getMessage().equals("Popping from empty stack")) {
        throw ex;
    }
    mCurrToken = XMLEvent.CHARACTERS;
    return mCurrToken;
}
@cowtowncoder
Copy link
Member

One quick comment just to make sure I understand this correctly: XML parsers are required to verify well-formedness, so quietly skipping "unmatched" close tags is a no-no and should not be the default behavior. But it is fine to have ways to configure things to allow that for use cases where it is desired (either via new property or, probably better, through error handler).

So... It would be nice of course to make it easier to recover from issues like this, although in general it is not something that can be solved for every case (and cost of trying to keep state recoverable in all cases is pretty steep)
But in this particular case perhaps there could be something as simple as reordering state update or something -- and if so, I would accept a patch. And I think that first figuring out where handling would work makes sense, after which specific configuration mechanism can be decided on.

@cowtowncoder cowtowncoder added the pr-welcome Issue for which progress most likely if someone submits a Pull Request label Nov 20, 2021
@cowtowncoder
Copy link
Member

I don't have time or interest to work on this issue, but if someone wants to try come up with a change to allow better handling of non-well-formed content, would be happy to help get such feature merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-welcome Issue for which progress most likely if someone submits a Pull Request
Projects
None yet
Development

No branches or pull requests

2 participants