[antlr-interest] Understanding priorities in lexing (newbie)

Fri Jul 13 02:26:32 PDT 2007

El 13/7/2007, a las 10:16, Gavin Lambert escribió:

> As you said, the fundamental DFA engine behind the lexer and parser  
> is the same, so the difference must lie elsewhere.  And there is  
> one difference: the lexer has one additional implicit rule, the  
> Tokens rule.  The parser has no equivalent; or rather the closest  
> equivalent it does have (the 'program' rule) contains a cycle (*),  
> so generates a very different DFA.
>
> Restating the examples a bit (and including the implicit rule)  
> makes this more obvious:
>
> lexer grammar Lex;
> KEYWORD : 'a' 'b' 'c';
> OTHER : 'a' | 'b' | 'c';
> Tokens : (  KEYWORD  |  OTHER  );
>
> parser grammar Parse;
> kEYWORD : A B C;
> oTHER : A | B | C;
> program : (  kEYWORD | oTHER  )*;
>
> The parser "works"; the lexer doesn't.  It has to be the Tokens rule.
>
> (Since the lexer example is explicitly showing an implicit rule,  
> it's not valid ANTLR -- if you actually want to run it in ANTLR,  
> you'd need to do one of the following:
> 1. remove the Tokens rule
> 2. make the KEYWORD and OTHER rules fragments, and possibly rename  
> the Tokens rule
> Similarly the parser would need to import its token vocab from  
> somewhere.)

Very good points, Gavin. It's clear if you look at the prediction  
code from the generated mTokens() method in the lexer grammar; note  
how seeing 'ab' is considered enough to predict 'abc':

         if ( (LA1_0=='a') ) {
             int LA1_1 = input.LA(2);

             if ( (LA1_1=='b') ) {
                 alt1=2;
             }
             else {
                 alt1=1;}
         }
         else if ( ((LA1_0>='b' && LA1_0<='c')) ) {
             alt1=1;
         }
         else {
             NoViableAltException nvae =
                 new NoViableAltException("1:1: Tokens : ( OTHER |  
KEYWORD );", 1, 0, input);

             throw nvae;
         }

And compare it with the prediction code in the "equivalent" parser  
grammar; here ABC will only be predicted if the parser sees ABC:

                 int alt1=3;
                 int LA1_0 = input.LA(1);

                 if ( (LA1_0==A) ) {
                     int LA1_2 = input.LA(2);

                     if ( (LA1_2==B) ) {
                         int LA1_4 = input.LA(3);

                         if ( (LA1_4==C) ) {
                             alt1=1;
                         }
                         else if ( (LA1_4==EOF||(LA1_4>=A &&  
LA1_4<=B)) ) {
                             alt1=2;
                         }

                     }
                     else if ( (LA1_2==EOF||LA1_2==A||LA1_2==C) ) {
                         alt1=2;
                     }

                 }
                 else if ( ((LA1_0>=B && LA1_0<=C)) ) {
                     alt1=2;
                 }

And I guess you're spot on in identifying that the cause for the  
difference in the prediction DFA is that the "program" rule has a (*)  
cycle and the synthesized rule corresponding to mTokens() does not.

The other obvious difference is that the lexer throws a  
NoViableAltException if it can't tokenize, whereas the parser just  
falls off the end of the program() method. You could add an explicit  
EOF to the program rule but that doesn't change the prediction logic,  
it just falls through the prediction DFA, hits the default case of  
the switch statement, exits the loop and then tries to match() an EOF  
(failing and throwing a RecognitionException).

Cheers,
Wincent