[antlr-interest] XML QName Character Validation
Tom Moog
tmoog at polhode.com
Fri Apr 4 10:19:10 PDT 2008
For NCName, I suggest you look only at the first character, then
accept anything which is not a delimiter (e.g. ":", space, angle
bracket, etc.. After the match, call a routine to check that the
match is a valid name This has two advantages:
a. It allows you to use the same grammar for both xml 1.0 and
xml 1.1 names.
b. The error messages are much better. Compare:
The element tag "foo at bar" is not a valid NCName
vs.
Unexpected character "@" ...
On Fri, 4 Apr 2008, Martin Probst wrote:
> Hi all,
>
> I'm making really nice progress with my XQuery grammar, thanks to the help of
> Jim Idle and Ter's awesome LL(*) algorithm.
>
> I'm facing a single last problem: in XQuery, QNames play an important role.
> QNames and keywords overlap, so it's a keyword-free grammar. The rules on
> what makes a legal QName from the XML spec are quite complex in their
> selection of Unicode characters, see here:
> http://www.w3.org/TR/REC-xml/#NT-Letter
>
> If I naively translate that to ANTLR fragment rules, ANTLR fails to analyze
> those rules:
> warning(205): XQuery.g:1:8: ANTLR could not analyze this decision in rule
> Tokens; often this is because of recursive rule references visible from the
> left edge of alternatives. ANTLR will re-analyze the decision with a fixed
> lookahead of k=1. Consider using "options {k=1;}" for that decision and
> possibly adding a syntactic predicate.
> error(10): internal error:
> org.antlr.tool.Grammar.createLookaheadDFA(Grammar.java:1152): could not even
> do k=1 for decision 24; reason: timed out (>1000ms)
>
> No blame here, those rules are probably better handled in a different way.
> The question is: which different way? I tried taking those rules out into a
> Java file called CharHelper, and having these rules:
>
> ... lots of tokens, snip ...
> UNION : 'union';
> UNORDERED
> : 'unordered';
> ... snip ...
> QName : NCName (':' NCName)?;
> fragment NCName : NCNameStartChar NCNameChar*;
> fragment NCNameStartChar
> : Letter | '_';
> fragment NCNameStartChar
> : Letter | '_';
> fragment NCNameChar
> : Letter | XMLDigit | '.' | '-' | '_' | CombiningChar |
> Extender;
> fragment Letter
> : { CharHelper.isLetter(LA(1) }? => .;
> fragment BaseChar
> : { CharHelper.isBaseChar(LA(1) }? => .;
> fragment Ideographic : {
> CharHelper.isIdeographic(LA(1)) }? => .;
> fragment XMLDigit
> : { CharHelper.isXMLDigit(LA(1)) }? => .;
> fragment CombiningChar
> : { CharHelper.isCombiningChar(LA(1)) }? => .;
> fragment Extender
> : { CharHelper.isExtender(LA(1)) }? => .;
>
> But this makes ANTLR complain about ambiguities:
> warning(209): XQuery.g:319:1: Multiple token rules can match input such as
> "'u'": UNION, UNORDERED, QName, Wildcard
> As a result, tokens(s) UNORDERED,QName,Wildcard were disabled for that input
>
> So apparently the lexical analysis is now behaving quite differently. Before
> all this, I just specified NCName to be ('a'..'z' | 'A'..'Z')+ and it worked
> like a charm. I somehow fail to see how (effectivly) changing that to
> '\u0000'..'\uFFFE' with a gating predicate changes this to be ambiguous.
>
> Any ideas?
>
> BTW: I've implemented a nice technique of lexer switching based on parser
> context. XQuery is a language who's lexical structure changes quite radically
> between normal expressions and the embedded XML literals.
>
> This is somewhat similar to the example at
> http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control.
> However I don't need to switch the full parser, as my grammatical rules for
> the whole language fit into one grammar - I just needed to change the way
> tokens are generated by the lexer. I'm going to write up my technique as an
> addendum to that page once it's done.
>
> I'd also like to make my grammar freely available on antlr.org once it's
> done, if there is interest. Do I just send it to Ter, or how does that work?
>
> Thanks,
> Martin
>
More information about the antlr-interest
mailing list