http://xml.apache.org/http://www.apache.org/http://www.w3.org/

About FOP

NEW DESIGN

Alt Design
co-routines
galleys
footnotes
keeps
space-specifiers

alt.properties
Classes overview
Properties classes
Properties
PropertyConsts
PropNames
AbsolutePosition
VerticalAlign
BorderCommonStyle

XML parsing

Property parsing

Compound properties
Traits
User agent refs


An alternative parser integration

This note proposes an alternative method of integrating the output of the SAX parsing of the Flow Object (FO) tree into FOP processing. The pupose of the proposed changes is to provide for better decomposition of the process of analysing and rendering an fo tree such as is represented in the output from initial (XSLT) processing of an XML source document.

Structure of SAX parsing

Figure 1 is a schematic representation of the process of SAX parsing of an input source. SAX parsing involves the registration, with an object implementing the XMLReader interface, of a ContentHandler which contains a callback routine for each of the event types encountered by the parser, e.g., startDocument(), startElement(), characters(), endElement() and endDocument(). Parsing is initiated by a call to the parser() method of the XMLReader. Note that the call to parser() and the calls to individual callback methods are synchronous: parser() will only return when the last callback method returns, and each callback must complete before the next is called.

Figure 1

SAX parsing schematic

In the process of parsing, the hierarchical structure of the original FO tree is flattened into a number of streams of events of the same type which are reported in the sequence in which they are encountered. Apart from that, the API imposes no structure or constraint which expresses the relationship between, e.g., a startElement event and the endElement event for the same element. To the extent that such relationship information is required, it must be managed by the callback routines.

The most direct approach here is to build the tree "invisibly"; to bury within the callback routines the necessary code to construct the tree. In the simplest case, the whole of the FO tree is built within the call to parser(), and that in-memory tree is subsequently processed to (a) validate the FO structure, and (b) construct the Area tree. The problem with this approach is the potential size of the FO tree in memory. FOP has suffered from this problem in the past.


Cluttered callbacks

On the other hand, the callback code may become increasingly complex as tree validation and the triggering of the Area tree processing and subsequent rendering is moved into the callbacks, typically the endElement() method. In order to overcome acute memory problems, the FOP code was recently modified in this way, to trigger Area tree building and rendering in the endElement() method, when the end of a page-sequence was detected.

The drawback with such a method is that it becomes difficult to detemine the order of events and the circumstances in which any particular processing events are triggered. When the processing events are inherently self-contained, this is irrelevant. But the more complex and context-dependent the relationships are among the processing elements, the more obscurity is engendered in the code by such "side-effect" processing.


From passive to active parsing

In order to solve the simultaneous problems of exposing the structure of the processing and minimising in-memory requirements, the experimental code separates the parsing of the input source from the building of the FO tree and all downstream processing. The callback routines become minimal, consisting of the creation and buffering of XMLEvent objects as a producer. All of these objects are effectively merged into a single event stream, in strict event order, for subsequent access by the FO tree building process, acting as a consumer. In itself, this does not reduce the footprint. This occurs when the approach is generalised to modularise FOP processing.

Figure 2

XML event buffer

The most useful change that this brings about is the switch from passive to active XML element processing. The process of parsing now becomes visible to the controlling process. All local validation requirements, all object and data structure building, is initiated by the process(es) getting from the queue - in the case above, the FO tree builder.


XMLEvent methods

The experimental code uses a class XMLEvent to provide the objects which are placed in the queue. XMLEvent includes a variety of methods to access elements in the queue. Namespace URIs encountered in parsing are maintined in a static HashMap where they are associated with a unique integer index. This integer value is used in the signature of some of the access methods.

  • XMLEvent getEvent(SyncedCircularBuffer events) -
  • This is the basis of all of the queue access methods. It returns the next element from the queue, which may be a pushback element.
  • XMLEvent getEndDocument(events) -
  • get and discard elements from the queue until an ENDDOCUMENT element is found and returned.
  • XMLEvent expectEndDocument(events) -
  • If the next element on the queue is an ENDDOCUMENT event, return it. Otherwise, push the element back and throw an exception. Each of the get methods (except getEvent() itself) has a corresponding expect method.
  • XMLEvent get/expectStartElement(events) -
  • Return the next STARTELEMENT event from the queue.
  • XMLEvent get/expectStartElement(events, String qName) -
  • Return the next STARTELEMENT with a QName matching qName.
  • XMLEvent get/expectStartElement(events, int uriIndex, String localName) -
  • Return the next STARTELEMENT with a URI indicated by the uriIndex and a local name matching localName.
  • XMLEvent get/expectStartElement(events, LinkedList list) -
  • list contains instances of the nested class UriLocalName, which hold a uriIndex and a localName. Return the next STARTELEMENT with a URI indicated by the uriIndex and a local name matching localName from any element of list.
  • XMLEvent get/expectEndElement(events) -
  • Return the next ENDELEMENT.
  • XMLEvent get/expectEndElement(events, qName) -
  • Return the next ENDELEMENT with QName qname.
  • XMLEvent get/expectEndElement(events, uriIndex, localName) -
  • Return the next ENDELEMENT with a URI indicated by the uriIndex and a local name matching localName.
  • XMLEvent get/expectEndElement(events, XMLEvent event) -
  • Return the next ENDELEMENT with a URI matching the uriIndex and localName matching those in the event argument. This is intended as a quick way to find the ENDELEMENT matching a previously returned STARTELEMENT.
  • XMLEvent get/expectCharacters(events) -
  • Return the next CHARACTERS event.

    FOP modularisation

    This same principle can be extended to the other major sub-systems of FOP processing. In each case, while it is possible to hold a complete intermediate result in memory, the memory costs of that approach are too high. The sub-systems - xml parsing, FO tree construction, Area tree construction and rendering - must run in parallel if the footprint is to be kept manageable. By creating a series of producer-consumer pairs linked by synchronized buffers, logical isolation can be achieved while rates of processing remain coupled. By introducing feedback loops conveying information about the completion of processing of the elements, sub-systems can dispose of or precis those elements without having to be tightly coupled to downstream processes.

    Figure 3

    FOP modularisation




    Copyright © 2001-2002 The Apache Software Foundation. All Rights Reserved.