Ovidiu Predescu's Weblog: JavaOne - XML parsing (with StAX)

June 11, 2003

JavaOne - XML parsing (with StAX)

Java

Even though I've heard about XML pull parsers, I haven't had time to play with them. From the abstract, this session seems to be a good introduction to the technology. The presenters, Ping Guo, Mark Scardina and K Karun are from Oracle.

I'm actually interested in more higher level XML processors, similar to XSLT, but that don't require a DOM tree representation of the input XML document. I looked at STX, Streaming Transformations for XML, and Joost, an open-source Java implementation of STX. The idea is intruiging and this particular implementation seems to be very clean. The CVS version seems to also have support for hooking up with the Java classes running outside the processor.

The Joost STX implementation uses a SAX parser to do the work, which makes it difficult to abort the processing of large XML documents. This could come in handy when in the middle of the processing, the program discovers it has nothing else to do. This would be a very useful feature for a streaming XML processor. StAX could solve this problem.

14:45 It started. Ping is the lead developer and the presenter of this session, K is an expert group member of JSR 173, and Mark is a manager. StAX is the subject of JSR 173.

Why do such presentations have to start with what is XML? Or examples of what an XML document is? I find this a waste of time, especially for a session marked as "intermediary" complexity. You would expect the audience to know these things.

Next is an introduction to different XML parsing methodologies. She introduces DOM now. She jumps to JAXP now, the Java API for accessing a DOM parser. She now presents a DOM parser example, which I found to be flawed because it uses getElementsByName(). The reason why this is bad, especially in the context of an input document conforming to a well known DTD, is because this method will traverse all the nodes in the document. She goes into explaining the cons of using DOM.

15:01 She now introduces SAX, explaining the model and giving a small example.

15:09 Finally, the meat of the talk, StAX. This new parsing technology uses a different event model. It has an XMLEvent abstract event, with StartDocument, StartElement etc. concrete events. The interface with the parse is very similar to JAXP. The parsing model is different from SAX: the application tells the parser to get the next event and the parser returns the concreate event objects.

The cursor style example

while (reader.hasNext()) {
  int eventType = reader.next();
  if (eventType == XMLEvent.START_ELEMENT && reader.getLocalName().equals("title")) {
    reader.next();
    println(reader.getText());
  }
}

The next example (iterator style) she presents asks the parser to return the actual event object, which may be easier to program but uses more memory because the parser has to create the event object (an actual XMLEvent instance).

The nice part about StAX is that you can process multiple XML documents at the same time, by simply sucking data from multiple parsers.

She now jumps into Web Services and JAX-RPC. This seems to be going over the scope of this talk, so I'll head out of the talk. StAX and XML pull-parsers seem to be a really good idea. I need to check them out.

Posted by ovidiu at June 11, 2003 03:23 PM |