[antlr-interest] XML QName Character Validation

Fri Apr 4 02:24:54 PDT 2008

Hi all,

I'm making really nice progress with my XQuery grammar, thanks to the  
help of Jim Idle and Ter's awesome LL(*) algorithm.

I'm facing a single last problem: in XQuery, QNames play an important  
role. QNames and keywords overlap, so it's a keyword-free grammar. The  
rules on what makes a legal QName from the XML spec are quite complex  
in their selection of Unicode characters, see here: http://www.w3.org/TR/REC-xml/#NT-Letter

If I naively translate that to ANTLR fragment rules, ANTLR fails to  
analyze those rules:
warning(205): XQuery.g:1:8: ANTLR could not analyze this decision in  
rule Tokens; often this is because of recursive rule references  
visible from the left edge of alternatives.  ANTLR will re-analyze the  
decision with a fixed lookahead of k=1.  Consider using "options  
{k=1;}" for that decision and possibly adding a syntactic predicate.
error(10):  internal error:  
org.antlr.tool.Grammar.createLookaheadDFA(Grammar.java:1152): could  
not even do k=1 for decision 24; reason: timed out (>1000ms)

No blame here, those rules are probably better handled in a different  
way. The question is: which different way? I tried taking those rules  
out into a Java file called CharHelper, and having these rules:

... lots of tokens, snip ...
UNION	:	'union';
UNORDERED
		:	'unordered';
... snip ...
QName	:	NCName (':' NCName)?;
fragment NCName	:	NCNameStartChar NCNameChar*;
fragment NCNameStartChar
		:	Letter | '_';
fragment NCNameStartChar
		:	Letter | '_';
fragment NCNameChar
		:	Letter | XMLDigit | '.' | '-' | '_' | CombiningChar | Extender;
fragment Letter
		:	{ CharHelper.isLetter(LA(1) }? =>  .;
fragment BaseChar
		:	{ CharHelper.isBaseChar(LA(1) }? =>  .;
fragment Ideographic	
		:	{ CharHelper.isIdeographic(LA(1)) }? =>  .;
fragment XMLDigit
		:	{ CharHelper.isXMLDigit(LA(1)) }? =>  .;
fragment CombiningChar
		:	{ CharHelper.isCombiningChar(LA(1)) }? =>  .;
fragment Extender
		:	{ CharHelper.isExtender(LA(1)) }? =>  .;

But this makes ANTLR complain about ambiguities:
warning(209): XQuery.g:319:1: Multiple token rules can match input  
such as "'u'": UNION, UNORDERED, QName, Wildcard
As a result, tokens(s) UNORDERED,QName,Wildcard were disabled for that  
input

So apparently the lexical analysis is now behaving quite differently.  
Before all this, I just specified NCName to be ('a'..'z' | 'A'..'Z')+  
and it worked like a charm. I somehow fail to see how (effectivly)  
changing that to '\u0000'..'\uFFFE' with a gating predicate changes  
this to be ambiguous.

Any ideas?

BTW: I've implemented a nice technique of lexer switching based on  
parser context. XQuery is a language who's lexical structure changes  
quite radically between normal expressions and the embedded XML  
literals.

This is somewhat similar to the example at http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control 
. However I don't need to switch the full parser, as my grammatical  
rules for the whole language fit into one grammar - I just needed to  
change the way tokens are generated by the lexer. I'm going to write  
up my technique as an addendum to that page once it's done.

I'd also like to make my grammar freely available on antlr.org once  
it's done, if there is interest. Do I just send it to Ter, or how does  
that work?

Thanks,
Martin