New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XMLEventReader
always returns the full content in text events, even if coalescing disabled
#142
Comments
I never use event-based API myself, and my understanding is that if you want more efficient access, you should use iterator-based approach instead. However, I can see how ability to change this behavior could be useful. I would be open to -- for example -- addition of a new Woodstox-specific configuration setting that would allow non-default behavior of allowing split text segments. I do not have time to work on this myself but would be happy to help if someone wanted to contribute improvements. Another thought: maybe you can actually change the "min segment length" for reader? Settings available are explained here: (see P_MIN_TEXT_SEGMENT) |
I don't think changing P_MIN_TEXT_SEGMENT can make a difference. (I've also tested it.) The value configured with this property is only read, when the "forER" (for Event Reader) flag is not set:
I was under the impression, that the "forER" branch above was introduced to fix some other, larger problem. That does not seem to be the case? I have changed the conditional and all tests seem to pass (except wstxtest.evt.TestEventReader#testEventReaderLongSegments, which explicitly tests this behaviour). Our own tests also pass with a thus modified BasicStreamReader, including the ones that test large text sections. So if the "forER" branch is only a convenience as described in the comment, the necessary change would only be, to make the behaviour configurable. That seems to be quite straight forward. Should I prepare a pull request? Regarding the question how this would be configured:
You are probably the better judge, which of these options would be better. Both would work for us. (2) is probably more elegant, but has the bigger risk of surprising users with a change of existing behaviour. (If they already explicitly set coalescing to false and depend on full text segments nevertheless.) Let me know if you would be interested in a pull request, which version of configuration (1 or 2) you would prefer and optionally how a possible new configuration setting should be named. |
As to the solution... hmmh. I was about to say that (1) seems safer to me. But if you can make (2) work reliably (and it sounds like you should be able to), why not? So I'd be happy with either one. |
I assume you haven't had time to look at the pull request yet. Just to avoid a misunderstanding, I would like to make sure, you are not waiting for something on my part? |
Hi @johannesherr! Apologies for slowness, I was on vacation and then did not circle back here. No, I think PR itself is something I just need to read through; seems to be along the lines we discussed. Since it has slight potential for surprises, thinking that if and when merging, should bump minor version, not just patch. |
XMLEventReader
always returns the full content in text events, even if coalescing disabled
The event based Woodstox API (XMLEventReader) always returns the full text content of elements. This leads to OutOfMemoryErrors, when the strings are large.
The specification seems to intend to split large text contents into multiple text events. Thereby allowing the user to avoid unlimited memory usage. That is the behaviour one sees in the JDK-Implementation of XMLInputFactory, com.sun.xml.internal.stream.XMLInputFactoryImpl.
Woodstox seems to deliberately avoid this behaviour as seen here:
woodstox/src/main/java/com/ctc/wstx/sr/BasicStreamReader.java
Line 442 in e313616
As far as I know the property javax.xml.stream.XMLInputFactory.IS_COALESCING should determine if a parser returns text contents as a single string. It has no effect on Woodstox behavior, unfortunately.
This makes it impossible for us to use the Woodstox parser.
Our use case is, to transfer XML documents and to modify some of them slightly. Because we mostly pass the documents through unchanged, the event based api seems a good fit for us, because we do not want to change most of the document. Therefore, we read the xml events and pass them to a XMLEventWriter. In some cases we need to make changes and modify the events, before writing them out.
Some documents we have to handle, contain up to 10 GB of data as text content of a single xml element. This leads to OOME, when the parser tries to return it as a single text event.
If this non-standard behavior of Woodstox is something you cannot fix, could you tell us what a workaround for this situation is? SAX or Stax base approaches seem a bad fit for our use case since you have to deliberately specify what parts of the xml you want to handle. This seems to increase the risk of inadvertently leaving out parts of the document and thereby causing unwanted changes, when processing it.
The text was updated successfully, but these errors were encountered: