Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMLEventReader always returns the full content in text events, even if coalescing disabled #142

Closed
johannesherr opened this issue Apr 5, 2022 · 5 comments
Milestone

Comments

@johannesherr
Copy link
Contributor

The event based Woodstox API (XMLEventReader) always returns the full text content of elements. This leads to OutOfMemoryErrors, when the strings are large.

The specification seems to intend to split large text contents into multiple text events. Thereby allowing the user to avoid unlimited memory usage. That is the behaviour one sees in the JDK-Implementation of XMLInputFactory, com.sun.xml.internal.stream.XMLInputFactoryImpl.

Woodstox seems to deliberately avoid this behaviour as seen here:

if (forER) {
    /* 30-Sep-2005, TSa: No point in returning runt segments for event readers
     *   (due to event object overhead, less convenient); let's just force
     *   returning of full length segments.
     */
    mShortestTextSegment = Integer.MAX_VALUE;

mShortestTextSegment = Integer.MAX_VALUE;

As far as I know the property javax.xml.stream.XMLInputFactory.IS_COALESCING should determine if a parser returns text contents as a single string. It has no effect on Woodstox behavior, unfortunately.

This makes it impossible for us to use the Woodstox parser.

Our use case is, to transfer XML documents and to modify some of them slightly. Because we mostly pass the documents through unchanged, the event based api seems a good fit for us, because we do not want to change most of the document. Therefore, we read the xml events and pass them to a XMLEventWriter. In some cases we need to make changes and modify the events, before writing them out.

Some documents we have to handle, contain up to 10 GB of data as text content of a single xml element. This leads to OOME, when the parser tries to return it as a single text event.

If this non-standard behavior of Woodstox is something you cannot fix, could you tell us what a workaround for this situation is? SAX or Stax base approaches seem a bad fit for our use case since you have to deliberately specify what parts of the xml you want to handle. This seems to increase the risk of inadvertently leaving out parts of the document and thereby causing unwanted changes, when processing it.

@cowtowncoder
Copy link
Member

I never use event-based API myself, and my understanding is that if you want more efficient access, you should use iterator-based approach instead.

However, I can see how ability to change this behavior could be useful. I would be open to -- for example -- addition of a new Woodstox-specific configuration setting that would allow non-default behavior of allowing split text segments.
Or, alternatively, as long as the default setting for Event API reader is that of coalescing on -- for backwards compatibility, users would absolutely be surprised if default behavior changed (vast majority prefering coalescing) -- I would also be fine with change that allows explicit change of coalescing to take effect.

I do not have time to work on this myself but would be happy to help if someone wanted to contribute improvements.

Another thought: maybe you can actually change the "min segment length" for reader? Settings available are explained here:

https://cowtowncoder.medium.com/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173

(see P_MIN_TEXT_SEGMENT)

@johannesherr
Copy link
Contributor Author

johannesherr commented Apr 7, 2022

I don't think changing P_MIN_TEXT_SEGMENT can make a difference. (I've also tested it.) The value configured with this property is only read, when the "forER" (for Event Reader) flag is not set:

if (forER) {
    /* 30-Sep-2005, TSa: No point in returning runt segments for event readers
     *   (due to event object overhead, less convenient); let's just force
     *   returning of full length segments.
     */
    mShortestTextSegment = Integer.MAX_VALUE;
} else {
    mShortestTextSegment = cfg.getShortestReportedTextSegment();    <-- the configured value is only applied, if not an event reader
}

I was under the impression, that the "forER" branch above was introduced to fix some other, larger problem. That does not seem to be the case? I have changed the conditional and all tests seem to pass (except wstxtest.evt.TestEventReader#testEventReaderLongSegments, which explicitly tests this behaviour). Our own tests also pass with a thus modified BasicStreamReader, including the ones that test large text sections.

So if the "forER" branch is only a convenience as described in the comment, the necessary change would only be, to make the behaviour configurable. That seems to be quite straight forward. Should I prepare a pull request?

Regarding the question how this would be configured:

  1. I assume a new property would be introduced similar to PROP_MIN_TEXT_SEGMENT.
  2. If an explicit setting of xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, false); should be respected, I guess I could use the field mConfigFlagMods and a method similar to _hasExplicitConfigFlag(..) to introduce a method that checks if coalescing was explicitly disabled. (Since the global default already means, that coalescing is disabled.)

You are probably the better judge, which of these options would be better. Both would work for us. (2) is probably more elegant, but has the bigger risk of surprising users with a change of existing behaviour. (If they already explicitly set coalescing to false and depend on full text segments nevertheless.)

Let me know if you would be interested in a pull request, which version of configuration (1 or 2) you would prefer and optionally how a possible new configuration setting should be named.

@cowtowncoder
Copy link
Member

forER indicates "for [when constructing] an EventReader", to specify differences from XMLStreamReader.
And the whole logic here was meant to cover a problem between Stax defaults and what (I think) users expect:

  1. As you point out, Stax has IS_COALESCING set to false, but
  2. Most users are REALLY surprised when "some of XML text is missing!!!" (I still get such question for streaming reader) -- until pointed out that content may be split, after which point many still think behavior is bad :)

As to the solution... hmmh. I was about to say that (1) seems safer to me. But if you can make (2) work reliably (and it sounds like you should be able to), why not?

So I'd be happy with either one.

@johannesherr
Copy link
Contributor Author

I assume you haven't had time to look at the pull request yet. Just to avoid a misunderstanding, I would like to make sure, you are not waiting for something on my part?

@cowtowncoder
Copy link
Member

Hi @johannesherr! Apologies for slowness, I was on vacation and then did not circle back here.

No, I think PR itself is something I just need to read through; seems to be along the lines we discussed.

Since it has slight potential for surprises, thinking that if and when merging, should bump minor version, not just patch.

@cowtowncoder cowtowncoder changed the title XMLEventReader always returns the full content in text events XMLEventReader always returns the full content in text events, even if coalescing disabled Apr 30, 2022
@cowtowncoder cowtowncoder added this to the 6.3.0 milestone Apr 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants