[antlr-interest] XML QName Character Validation

Fri Apr 4 10:19:10 PDT 2008

For NCName, I suggest you look only at the first character, then 
accept anything which is not a delimiter (e.g. ":", space, angle 
bracket, etc..  After the match, call a routine to check that the 
match is a valid name This has two advantages:

a. It allows you to use the same grammar for both xml 1.0 and
xml 1.1 names.

b. The error messages are much better.  Compare:

    The element tag "foo at bar" is not a valid NCName
vs.
    Unexpected character "@" ...

On Fri, 4 Apr 2008, Martin Probst wrote:

> Hi all,
>
> I'm making really nice progress with my XQuery grammar, thanks to the help of 
> Jim Idle and Ter's awesome LL(*) algorithm.
>
> I'm facing a single last problem: in XQuery, QNames play an important role. 
> QNames and keywords overlap, so it's a keyword-free grammar. The rules on 
> what makes a legal QName from the XML spec are quite complex in their 
> selection of Unicode characters, see here: 
> http://www.w3.org/TR/REC-xml/#NT-Letter
>
> If I naively translate that to ANTLR fragment rules, ANTLR fails to analyze 
> those rules:
> warning(205): XQuery.g:1:8: ANTLR could not analyze this decision in rule 
> Tokens; often this is because of recursive rule references visible from the 
> left edge of alternatives.  ANTLR will re-analyze the decision with a fixed 
> lookahead of k=1.  Consider using "options {k=1;}" for that decision and 
> possibly adding a syntactic predicate.
> error(10):  internal error: 
> org.antlr.tool.Grammar.createLookaheadDFA(Grammar.java:1152): could not even 
> do k=1 for decision 24; reason: timed out (>1000ms)
>
> No blame here, those rules are probably better handled in a different way. 
> The question is: which different way? I tried taking those rules out into a 
> Java file called CharHelper, and having these rules:
>
> ... lots of tokens, snip ...
> UNION	:	'union';
> UNORDERED
> 		:	'unordered';
> ... snip ...
> QName	:	NCName (':' NCName)?;
> fragment NCName	:	NCNameStartChar NCNameChar*;
> fragment NCNameStartChar
> 		:	Letter | '_';
> fragment NCNameStartChar
> 		:	Letter | '_';
> fragment NCNameChar
> 		:	Letter | XMLDigit | '.' | '-' | '_' | CombiningChar | 
> Extender;
> fragment Letter
> 		:	{ CharHelper.isLetter(LA(1) }? =>  .;
> fragment BaseChar
> 		:	{ CharHelper.isBaseChar(LA(1) }? =>  .;
> fragment Ideographic			:	{ 
> CharHelper.isIdeographic(LA(1)) }? =>  .;
> fragment XMLDigit
> 		:	{ CharHelper.isXMLDigit(LA(1)) }? =>  .;
> fragment CombiningChar
> 		:	{ CharHelper.isCombiningChar(LA(1)) }? =>  .;
> fragment Extender
> 		:	{ CharHelper.isExtender(LA(1)) }? =>  .;
>
> But this makes ANTLR complain about ambiguities:
> warning(209): XQuery.g:319:1: Multiple token rules can match input such as 
> "'u'": UNION, UNORDERED, QName, Wildcard
> As a result, tokens(s) UNORDERED,QName,Wildcard were disabled for that input
>
> So apparently the lexical analysis is now behaving quite differently. Before 
> all this, I just specified NCName to be ('a'..'z' | 'A'..'Z')+ and it worked 
> like a charm. I somehow fail to see how (effectivly) changing that to 
> '\u0000'..'\uFFFE' with a gating predicate changes this to be ambiguous.
>
> Any ideas?
>
> BTW: I've implemented a nice technique of lexer switching based on parser 
> context. XQuery is a language who's lexical structure changes quite radically 
> between normal expressions and the embedded XML literals.
>
> This is somewhat similar to the example at 
> http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control. 
> However I don't need to switch the full parser, as my grammatical rules for 
> the whole language fit into one grammar - I just needed to change the way 
> tokens are generated by the lexer. I'm going to write up my technique as an 
> addendum to that page once it's done.
>
> I'd also like to make my grammar freely available on antlr.org once it's 
> done, if there is interest. Do I just send it to Ter, or how does that work?
>
> Thanks,
> Martin
>