[antlr-interest] Understanding priorities in lexing (newbie)

Fri Jul 13 01:16:43 PDT 2007

At 11:18 13/07/2007, Terence Parr wrote:
 >> Case 1:
 >> grammar lex;
 >> KEYWORD : 'a' 'b' 'c';
 >> OTHER : 'a' | 'b' | 'c';
 >> program : (  KEYWORD  |  OTHER  )*
 >>
 >> Input: "aba" chokes on second a
 >>
 >> Case 2:
 >> grammar lex;
 >> kEYWORD : 'a' 'b' 'c';
 >> oTHER : 'a' | 'b' | 'c';
 >> program : (  kEYWORD |  oTHER  )*
 >>
 >> Input: "aba" outputs oTHER oTHER oTHER
 >>
 >> Same grammar, two different state machines.
 >
 >Well, more that the first grammar has 2 lexer rules and the
 >second has three: one for each char.

Yes, which simply means that the lexer is different.  I don't 
think that was the point :)  For the first example, you should 
consider the lexer *only*, and in the second example, the parser 
*only*.  The other halves are irrelevant.

In the first case, the input to the lexer is the character 
sequence 'a','b','a'.  In the second case, the input to the 
*parser* is the token sequence A,B,A.  There is no fundamental 
difference here -- the rules for dealing with that input appear 
the same and therefore should result in the same output.  But they 
don't.

As you said, the fundamental DFA engine behind the lexer and 
parser is the same, so the difference must lie elsewhere.  And 
there is one difference: the lexer has one additional implicit 
rule, the Tokens rule.  The parser has no equivalent; or rather 
the closest equivalent it does have (the 'program' rule) contains 
a cycle (*), so generates a very different DFA.

Restating the examples a bit (and including the implicit rule) 
makes this more obvious:

lexer grammar Lex;
KEYWORD : 'a' 'b' 'c';
OTHER : 'a' | 'b' | 'c';
Tokens : (  KEYWORD  |  OTHER  );

parser grammar Parse;
kEYWORD : A B C;
oTHER : A | B | C;
program : (  kEYWORD | oTHER  )*;

The parser "works"; the lexer doesn't.  It has to be the Tokens 
rule.

(Since the lexer example is explicitly showing an implicit rule, 
it's not valid ANTLR -- if you actually want to run it in ANTLR, 
you'd need to do one of the following:
1. remove the Tokens rule
2. make the KEYWORD and OTHER rules fragments, and possibly rename 
the Tokens rule
Similarly the parser would need to import its token vocab from 
somewhere.)