[antlr-interest] Somewhat off topic: XML/XSD parser generators and processing

Thu Feb 20 12:01:26 PST 2003

  I've been unsuccessfully trying to find an answer to a question
  about XML Schema based parser generators.  Since a number of people
  in the ANTLR community are not only parser and compiler experts, but
  are also Java and XML gurus, I am taking the liberty of asking the
  question here, although it is not really related to ANTLR.

  An XML Schema (an "XSD") is a grammar that describes an XML
  document.  Processors like Xerces will read an XSD and use it to
  validate an XML document.  Here, "validation" means checking for
  syntactic correctness, with a little semantic checking thrown in.
  XSD processors like Xerces act like parser generators in that they
  parse on the basis of a grammar definition.  Oddly the XML community
  does not explicitly describe the process in the language of parsers.

  In some cases the XSD prefaces the XML document that the XSD is
  intended to validate.  This structure means that XSD processors must
  operate "on the fly", since a new XSD (grammar) may be presented for
  every XML document that is to be validated.

  XML processing in general tends to be slow and validation is even
  slower.  I saw an article somewhere about a company that was
  proposing dedicated hardware to speedup XML validation and
  processing.  But what I've been wondering is if one could use parser
  technology to speed the task.  Given the speed of general purpose
  processors it is not clear to me that dedicated hardware is really
  needed.

  There is not much hope for significantly speeding up the "on the
  fly" validation task (where a new schema is presented for each XML
  document).  In many cases, however, there is a rarely changed set of
  XSDs that are used to validate a much larger set of XML documents.
  In this case, parser technology might be used to improve
  performance.

  The parser architecture which seems most appropriate in this case
  would not actually be an ANTLR style recursive decent parser, but
  rather a YACC style state table driven parser.

  In this approach a state table would be generated from the XSD
  (grammar).  The state table can then be cached.  If an XML document
  is received that references an XSD with a cached state table, the
  state table can be used to process the XML document, rather than
  referencing the XML schema again.  Given the speed of parsers
  generated by tools like YACC, processing should be fairly fast.

  This seems like an obvious approach to handling XSDs and XML
  schemas.  However, from looking at the documentation on verifying
  XML processors, I've been unable to determine whether any of the
  tools do this.  Do you know of any tools that follow this approach?
  If not, is there something I'm missing that makes this approach
  unworkable, even in the case where XSDs change infrequently?

  Thanks for your patience with this rather off topic query.

  Ian Kaplan
  iank at bearcave.com
  www.bearcave.com

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/