[antlr-interest] token precedence by decl order - or tutorial ambiguous

Sun Mar 4 21:21:23 PST 2012

Hello,

To briefly introduce myself, I am a new user of ANTLR and intend to use it to 
help me develop grammars and parsers for my new database-savvy programming 
language Muldis D; https://github.com/muldis/ is where all the related stuff is.

For the main point of this post ...

The tutorial at 
http://www.antlr.org/wiki/display/ANTLR3/Quick+Starter+on+Parser+Grammars+-+No+Past+Experience+Required 
is useful but I found at least one part of it to be ambiguous and I'm hoping you 
can help me clear my understanding, and maybe the tutorial itself can be cleared 
up too.

The relevant portion of the tutorial is here between the pair of === lines:

==========

Another point of interest is the order of the token declaration. The earlier a 
token is defined, the higher is the precedence if a certain input can be matched 
by two or more tokens. This means that using the tokens command to define 
keywords will match those keywords instead a more general ID rule. The following 
code snippet provides an example:

   start
    :  (WS |  FOO)* EOF
    ;
   WS : (options {greedy=false;} : ' '+) ;
   FOO : ~('x' | 'y' | 'z')+ ;

If you give an input containing only spaces then WS will be chosen. Should one 
change the rules order that FOO comes before WS then FOO will be chosen. Any 
input containing other characters than spaces will match FOO, even if two or 
more WS and FOO tokens could be produced. The lexer rules will match greedily 
the maximum of applicable characters.

==========

Now the ambiguity I see concerns the "order of the token declaration" part.  I 
don't know whether it is referring to the declared order in the line containing 
"(WS|FOO)" or the order of the 2 lines "WS : ..." vs "FOO : ...".  This is 
because the order is [WS,FOO] in both places.

If the tutorial could be updated to, say, either of these, it would be much more 
clear:

1:

   start
    :  (FOO |  WS)* EOF
    ;
   WS : (options {greedy=false;} : ' '+) ;
   FOO : ~('x' | 'y' | 'z')+ ;

2:

   start
    :  (WS |  FOO)* EOF
    ;
   FOO : ~('x' | 'y' | 'z')+ ;
   WS : (options {greedy=false;} : ' '+) ;

So, bottom line, between #1 and #2 here, which example would have WS taking 
higher precedence, and which example would have FOO taking higher precedence?

Thank you in advance for your help.

-- Darren Duncan