`XMLEventReader` always returns the full content in text events, even if coalescing disabled #142

johannesherr · 2022-04-05T11:10:12Z

The event based Woodstox API (XMLEventReader) always returns the full text content of elements. This leads to OutOfMemoryErrors, when the strings are large.

The specification seems to intend to split large text contents into multiple text events. Thereby allowing the user to avoid unlimited memory usage. That is the behaviour one sees in the JDK-Implementation of XMLInputFactory, com.sun.xml.internal.stream.XMLInputFactoryImpl.

Woodstox seems to deliberately avoid this behaviour as seen here:

if (forER) {
    /* 30-Sep-2005, TSa: No point in returning runt segments for event readers
     *   (due to event object overhead, less convenient); let's just force
     *   returning of full length segments.
     */
    mShortestTextSegment = Integer.MAX_VALUE;

woodstox/src/main/java/com/ctc/wstx/sr/BasicStreamReader.java

Line 442 in e313616

mShortestTextSegment = Integer.MAX_VALUE;

As far as I know the property javax.xml.stream.XMLInputFactory.IS_COALESCING should determine if a parser returns text contents as a single string. It has no effect on Woodstox behavior, unfortunately.

This makes it impossible for us to use the Woodstox parser.

Our use case is, to transfer XML documents and to modify some of them slightly. Because we mostly pass the documents through unchanged, the event based api seems a good fit for us, because we do not want to change most of the document. Therefore, we read the xml events and pass them to a XMLEventWriter. In some cases we need to make changes and modify the events, before writing them out.

Some documents we have to handle, contain up to 10 GB of data as text content of a single xml element. This leads to OOME, when the parser tries to return it as a single text event.

If this non-standard behavior of Woodstox is something you cannot fix, could you tell us what a workaround for this situation is? SAX or Stax base approaches seem a bad fit for our use case since you have to deliberately specify what parts of the xml you want to handle. This seems to increase the risk of inadvertently leaving out parts of the document and thereby causing unwanted changes, when processing it.

The text was updated successfully, but these errors were encountered:

cowtowncoder · 2022-04-05T19:13:59Z

I never use event-based API myself, and my understanding is that if you want more efficient access, you should use iterator-based approach instead.

However, I can see how ability to change this behavior could be useful. I would be open to -- for example -- addition of a new Woodstox-specific configuration setting that would allow non-default behavior of allowing split text segments.
Or, alternatively, as long as the default setting for Event API reader is that of coalescing on -- for backwards compatibility, users would absolutely be surprised if default behavior changed (vast majority prefering coalescing) -- I would also be fine with change that allows explicit change of coalescing to take effect.

I do not have time to work on this myself but would be happy to help if someone wanted to contribute improvements.

Another thought: maybe you can actually change the "min segment length" for reader? Settings available are explained here:

https://cowtowncoder.medium.com/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173

(see P_MIN_TEXT_SEGMENT)

johannesherr · 2022-04-07T15:46:04Z

I don't think changing P_MIN_TEXT_SEGMENT can make a difference. (I've also tested it.) The value configured with this property is only read, when the "forER" (for Event Reader) flag is not set:

if (forER) {
    /* 30-Sep-2005, TSa: No point in returning runt segments for event readers
     *   (due to event object overhead, less convenient); let's just force
     *   returning of full length segments.
     */
    mShortestTextSegment = Integer.MAX_VALUE;
} else {
    mShortestTextSegment = cfg.getShortestReportedTextSegment();    <-- the configured value is only applied, if not an event reader
}

I was under the impression, that the "forER" branch above was introduced to fix some other, larger problem. That does not seem to be the case? I have changed the conditional and all tests seem to pass (except wstxtest.evt.TestEventReader#testEventReaderLongSegments, which explicitly tests this behaviour). Our own tests also pass with a thus modified BasicStreamReader, including the ones that test large text sections.

So if the "forER" branch is only a convenience as described in the comment, the necessary change would only be, to make the behaviour configurable. That seems to be quite straight forward. Should I prepare a pull request?

Regarding the question how this would be configured:

I assume a new property would be introduced similar to PROP_MIN_TEXT_SEGMENT.
If an explicit setting of xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, false); should be respected, I guess I could use the field mConfigFlagMods and a method similar to _hasExplicitConfigFlag(..) to introduce a method that checks if coalescing was explicitly disabled. (Since the global default already means, that coalescing is disabled.)

You are probably the better judge, which of these options would be better. Both would work for us. (2) is probably more elegant, but has the bigger risk of surprising users with a change of existing behaviour. (If they already explicitly set coalescing to false and depend on full text segments nevertheless.)

Let me know if you would be interested in a pull request, which version of configuration (1 or 2) you would prefer and optionally how a possible new configuration setting should be named.

cowtowncoder · 2022-04-07T22:00:03Z

forER indicates "for [when constructing] an EventReader", to specify differences from XMLStreamReader.
And the whole logic here was meant to cover a problem between Stax defaults and what (I think) users expect:

As you point out, Stax has IS_COALESCING set to false, but
Most users are REALLY surprised when "some of XML text is missing!!!" (I still get such question for streaming reader) -- until pointed out that content may be split, after which point many still think behavior is bad :)

As to the solution... hmmh. I was about to say that (1) seems safer to me. But if you can make (2) work reliably (and it sounds like you should be able to), why not?

So I'd be happy with either one.

johannesherr · 2022-04-25T09:03:47Z

I assume you haven't had time to look at the pull request yet. Just to avoid a misunderstanding, I would like to make sure, you are not waiting for something on my part?

cowtowncoder · 2022-04-28T23:39:07Z

Hi @johannesherr! Apologies for slowness, I was on vacation and then did not circle back here.

No, I think PR itself is something I just need to read through; seems to be along the lines we discussed.

Since it has slight potential for surprises, thinking that if and when merging, should bump minor version, not just patch.

johannesherr added a commit to johannesherr/woodstox that referenced this issue Apr 8, 2022

Fix FasterXML#142 (multiple text events for long segments if requested)

2f5b7c7

johannesherr mentioned this issue Apr 8, 2022

Fix #142 (multiple text events for long segments if requested) #146

Merged

cowtowncoder closed this as completed in 22b17c1 Apr 30, 2022

cowtowncoder changed the title ~~XMLEventReader always returns the full content in text events~~ XMLEventReader always returns the full content in text events, even if coalescing disabled Apr 30, 2022

cowtowncoder added this to the 6.3.0 milestone Apr 30, 2022

cowtowncoder added a commit that referenced this issue Apr 30, 2022

Update release notes wrt #142, bump version to 6.3.0-SNAPSHOT

7a3e4c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`XMLEventReader` always returns the full content in text events, even if coalescing disabled #142

`XMLEventReader` always returns the full content in text events, even if coalescing disabled #142

johannesherr commented Apr 5, 2022

cowtowncoder commented Apr 5, 2022

johannesherr commented Apr 7, 2022 •

edited

cowtowncoder commented Apr 7, 2022

johannesherr commented Apr 25, 2022

cowtowncoder commented Apr 28, 2022

XMLEventReader always returns the full content in text events, even if coalescing disabled #142

XMLEventReader always returns the full content in text events, even if coalescing disabled #142

Comments

johannesherr commented Apr 5, 2022

cowtowncoder commented Apr 5, 2022

johannesherr commented Apr 7, 2022 • edited

cowtowncoder commented Apr 7, 2022

johannesherr commented Apr 25, 2022

cowtowncoder commented Apr 28, 2022

`XMLEventReader` always returns the full content in text events, even if coalescing disabled #142

`XMLEventReader` always returns the full content in text events, even if coalescing disabled #142

johannesherr commented Apr 7, 2022 •

edited