Skip to content

Code sample: Async parsing

James Kleeh edited this page May 31, 2019 · 4 revisions

Sample code: Async parsing with Aalto

Non-blocking parsing (also known as "async(hronous) parsing" means that in cases where no input is yet available (not yet sent by server, for example), instead of blocking thread of execution and waiting for more input transparently, parser returns a special market (EVENT_INCOMPLETE) and lets caller decide what to do. In addition, to support this functionality in Java, a different sort of input access mechanism is needed (in other languages runtime or language may offer other mechanisms). In case of Aalto, this means addition of a simple AsyncInputFeeder interface, through which caller "feeds" more input as needed.

So what does this mean in practice? We can divide difference in two parts, as per above -- difference in reading, and difference in feeding input -- but let's just look at an example unit test that shows expected behavior.

Input data

Document we use is very simple

<root>value</root>

Code

With such simple content, here is code adapted from unit tests. It is not minimal in any way, and is intended to show specific mechanisms used. Real production code would look somewhat different because feeding of input would be based on reading it from some external source (network socket, file), most likely using non-blocking input handling (NIO) and/or callbacks.

AsyncXMLInputFactory inputF = new InputFactoryImpl(); // sub-class of XMLStreamReader2
// two choices for input feeding: byte[] or ByteBuffer. Here we use former:
byte[] input_part1 = "<root>val".getBytes("UTF-8"); // would come from File, over the net etc
// can construct with initial data, or without; here we initialize with it
AsyncXMLStreamReader<AsyncByteArrayFeeder> parser = inputF.createAsyncFor(input_part1);

// now can access couple of events
assertTokenType(XMLStreamConstants.START_DOCUMENT, parser.next());
assertTokenType(XMLStreamConstants.START_ELEMENT, parser.next());
assertEquals("root", parser.getLocalName());
// since we have parts of CHARACTERS, we'll still get that first:
assertTokenType(XMLStreamConstants.CHARACTERS, parser.next());
assertEquals("val", parser.getText();
// but that's all data we had so:
assertTokenType(AsyncXMLStreamReader.EVENT_INCOMPLETE, parser.next());

// at this point, must feed more data:
byte[] input_part2 = "ue</root>".getBytes("UTF-8");
parser.getInputFeeder().feedInput(input_part2, 0, input_part2.length);

// and can parse that
assertTokenType(XMLStreamConstants.CHARACTERS, parser.next());
assertEquals("ue", parser.getText();
assertTokenType(XMLStreamConstants.END_ELEMENT, parser.next());
assertEquals("root", parser.getLocalName());
assertTokenType(AsyncXMLStreamReader.EVENT_INCOMPLETE, parser.next());

// and if we now ran out of data need to indicate that too
parser.getInputFeeder().endOfInput();
// which lets us conclude parsing
assertTokenType(XMLStreamConstants.END_DOCUMENT, parser.next());
parser.close();

Limitations

And this is pretty much it: parsing is only different in that code has to assume that EVENT_INCOMPLETE may be received at any point. But there is one important different to general blocking Stax parsing: not all functionality is available, due to differents in blocking and non-blocking input access. Specifically any methods that depend on producing the answer and without means to communicate "not yet enough data to do that" state will not work. This includes, most notably, various "getElementXxx()" methods like:

  • Coalescing mode is not implemented: if you must adjoin adjacent cdata sections, you will need to handle it yourself
    • All character data within buffer available WILL be coalesced; parser will only return "partial" sections if input buffer contains only part of such cdata.
  • XMLStreamReader.getElementText() is not supported: trying to call it will result in an exception
  • Typed element methods from TypedXMLStreamReader (implemented by XMLStreamReader2) will similarly fail:
    • getElementAsInt() (and all other getElementAsXxx() variants)
    • readElementAsInt() (and all other readElementAsXxx() variants)

One thing to note is that advanced methods for accessing XML attributes will work without problems: this because ALL information regarding one complete XML (start) element will be read before element is considered fully parsed. The problem with element-methods listed above is that they would require parsing of a sequence of tokens; and in case of incomplete input, keeping track of all such incomplete state (instead of state pertaining to a single token).

Benefits

Aside from "how", it is useful to consider "why" aspect as well; and also "why not". Briefly, the reason for use include:

  1. Need to limit number of concurrent threads: with blocking I/O, parsing, each concurrent parsing operation requires one dedicated thread. Non-blocking parsing decouples this; caller may use any number of threads it wants to
  2. Desire to limit amount of buffered input, to limit amount of memory used and limit possibility of denial-of-service attacks. Since amount of state kept is minimal, parser state is generally smaller than with blocking input -- character is not (for example) accumulate

Drawbacks

Aside from limitations listed earlier, there is some amount of performance overhead for keeping additional state (required to keep track of parsing state, location, at byte-accurate offsets). This should not be significant even for XML-heavy use cases; but for strictly maximal processing blocking parsing can be slightly more efficient (in 5-10% higher throughput range).