[antlr-interest] XML QName Character Validation
Martin Probst
mail at martin-probst.com
Fri Apr 4 02:24:54 PDT 2008
Hi all,
I'm making really nice progress with my XQuery grammar, thanks to the
help of Jim Idle and Ter's awesome LL(*) algorithm.
I'm facing a single last problem: in XQuery, QNames play an important
role. QNames and keywords overlap, so it's a keyword-free grammar. The
rules on what makes a legal QName from the XML spec are quite complex
in their selection of Unicode characters, see here: http://www.w3.org/TR/REC-xml/#NT-Letter
If I naively translate that to ANTLR fragment rules, ANTLR fails to
analyze those rules:
warning(205): XQuery.g:1:8: ANTLR could not analyze this decision in
rule Tokens; often this is because of recursive rule references
visible from the left edge of alternatives. ANTLR will re-analyze the
decision with a fixed lookahead of k=1. Consider using "options
{k=1;}" for that decision and possibly adding a syntactic predicate.
error(10): internal error:
org.antlr.tool.Grammar.createLookaheadDFA(Grammar.java:1152): could
not even do k=1 for decision 24; reason: timed out (>1000ms)
No blame here, those rules are probably better handled in a different
way. The question is: which different way? I tried taking those rules
out into a Java file called CharHelper, and having these rules:
... lots of tokens, snip ...
UNION : 'union';
UNORDERED
: 'unordered';
... snip ...
QName : NCName (':' NCName)?;
fragment NCName : NCNameStartChar NCNameChar*;
fragment NCNameStartChar
: Letter | '_';
fragment NCNameStartChar
: Letter | '_';
fragment NCNameChar
: Letter | XMLDigit | '.' | '-' | '_' | CombiningChar | Extender;
fragment Letter
: { CharHelper.isLetter(LA(1) }? => .;
fragment BaseChar
: { CharHelper.isBaseChar(LA(1) }? => .;
fragment Ideographic
: { CharHelper.isIdeographic(LA(1)) }? => .;
fragment XMLDigit
: { CharHelper.isXMLDigit(LA(1)) }? => .;
fragment CombiningChar
: { CharHelper.isCombiningChar(LA(1)) }? => .;
fragment Extender
: { CharHelper.isExtender(LA(1)) }? => .;
But this makes ANTLR complain about ambiguities:
warning(209): XQuery.g:319:1: Multiple token rules can match input
such as "'u'": UNION, UNORDERED, QName, Wildcard
As a result, tokens(s) UNORDERED,QName,Wildcard were disabled for that
input
So apparently the lexical analysis is now behaving quite differently.
Before all this, I just specified NCName to be ('a'..'z' | 'A'..'Z')+
and it worked like a charm. I somehow fail to see how (effectivly)
changing that to '\u0000'..'\uFFFE' with a gating predicate changes
this to be ambiguous.
Any ideas?
BTW: I've implemented a nice technique of lexer switching based on
parser context. XQuery is a language who's lexical structure changes
quite radically between normal expressions and the embedded XML
literals.
This is somewhat similar to the example at http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control
. However I don't need to switch the full parser, as my grammatical
rules for the whole language fit into one grammar - I just needed to
change the way tokens are generated by the lexer. I'm going to write
up my technique as an addendum to that page once it's done.
I'd also like to make my grammar freely available on antlr.org once
it's done, if there is interest. Do I just send it to Ter, or how does
that work?
Thanks,
Martin
More information about the antlr-interest
mailing list