[antlr-interest] Somewhat off topic: XML/XSD parser generators and processing
iank at bearcave.com
iank at bearcave.com
Thu Feb 20 12:01:26 PST 2003
I've been unsuccessfully trying to find an answer to a question
about XML Schema based parser generators. Since a number of people
in the ANTLR community are not only parser and compiler experts, but
are also Java and XML gurus, I am taking the liberty of asking the
question here, although it is not really related to ANTLR.
An XML Schema (an "XSD") is a grammar that describes an XML
document. Processors like Xerces will read an XSD and use it to
validate an XML document. Here, "validation" means checking for
syntactic correctness, with a little semantic checking thrown in.
XSD processors like Xerces act like parser generators in that they
parse on the basis of a grammar definition. Oddly the XML community
does not explicitly describe the process in the language of parsers.
In some cases the XSD prefaces the XML document that the XSD is
intended to validate. This structure means that XSD processors must
operate "on the fly", since a new XSD (grammar) may be presented for
every XML document that is to be validated.
XML processing in general tends to be slow and validation is even
slower. I saw an article somewhere about a company that was
proposing dedicated hardware to speedup XML validation and
processing. But what I've been wondering is if one could use parser
technology to speed the task. Given the speed of general purpose
processors it is not clear to me that dedicated hardware is really
needed.
There is not much hope for significantly speeding up the "on the
fly" validation task (where a new schema is presented for each XML
document). In many cases, however, there is a rarely changed set of
XSDs that are used to validate a much larger set of XML documents.
In this case, parser technology might be used to improve
performance.
The parser architecture which seems most appropriate in this case
would not actually be an ANTLR style recursive decent parser, but
rather a YACC style state table driven parser.
In this approach a state table would be generated from the XSD
(grammar). The state table can then be cached. If an XML document
is received that references an XSD with a cached state table, the
state table can be used to process the XML document, rather than
referencing the XML schema again. Given the speed of parsers
generated by tools like YACC, processing should be fairly fast.
This seems like an obvious approach to handling XSDs and XML
schemas. However, from looking at the documentation on verifying
XML processors, I've been unable to determine whether any of the
tools do this. Do you know of any tools that follow this approach?
If not, is there something I'm missing that makes this approach
unworkable, even in the case where XSDs change infrequently?
Thanks for your patience with this rather off topic query.
Ian Kaplan
iank at bearcave.com
www.bearcave.com
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list