[antlr-interest] Understanding priorities in lexing (newbie)

Fri Jul 13 07:02:39 PDT 2007

Hi,

Thank you all for the clear responses. I think I'm getting it.

>From the clarified grammar below, seperating the Lexer from the Parser, I 
can see that 'ab' is an error. From what I understand the DFA for the Lexer 
looks something like:

1 -- a --> 2 -- b --> 3 -- c --> 4
   -- b --> 5
   -- c --> 6

with 2, 4, 5 and 6 being accepting states. So after consumming characters 
and advancing through the state machine, ANTLR typically comes to a point 
where it can no longer advance and  it stops. If it stops on an accepting 
state it returns the appropriate token, if not it errors out. So on 'aba' it 
chokes on the second 'a'. But in a way, the character it stops on is always 
an error, a character that does not allow it to continue. ANTLR accepts this 
as ok if it is already on an accepting state and returns the token.

With LEX, the DFA is the same, but LEX makes an additional consession. It 
seems to accept that the character it stops on is part of a larger stream of 
characters that it will have to deal with later. So when it stops it looks 
to see if it is on an accepting state and if so returns that token but if it 
is not on an accepting state it asks itself the further question: What was 
the last accepting state I saw because the extra characters I consumed might 
just be part of the next token and not an error on the current token. So it 
looks back to find the last accepting state and leaves the extra characters 
for next time. This behaviour of the Lexer is realistic and useful.

Could this be done in ANTLR by just keeping track of the last longest 
accepting state in some variable. Evidently, if there are intervening 
actions, then you are really talking of a different grammar but if there are 
no interspersed  actions could this not be done?

Regards,

Daniel

----- Original Message ----- 
From: "Gavin Lambert" <antlr at mirality.co.nz>
To: "Terence Parr" <parrt at cs.usfca.edu>; "ANTLR-Interest" 
<antlr-interest at antlr.org>
Sent: Friday, July 13, 2007 4:16 AM
Subject: Re: [antlr-interest] Understanding priorities in lexing (newbie)

> At 11:18 13/07/2007, Terence Parr wrote:
> >> Case 1:
> >> grammar lex;
> >> KEYWORD : 'a' 'b' 'c';
> >> OTHER : 'a' | 'b' | 'c';
> >> program : (  KEYWORD  |  OTHER  )*
> >>
> >> Input: "aba" chokes on second a
> >>
> >> Case 2:
> >> grammar lex;
> >> kEYWORD : 'a' 'b' 'c';
> >> oTHER : 'a' | 'b' | 'c';
> >> program : (  kEYWORD |  oTHER  )*
> >>
> >> Input: "aba" outputs oTHER oTHER oTHER
> >>
> >> Same grammar, two different state machines.
> >
> >Well, more that the first grammar has 2 lexer rules and the
> >second has three: one for each char.
>
> Yes, which simply means that the lexer is different.  I don't think that 
> was the point :)  For the first example, you should consider the lexer 
> *only*, and in the second example, the parser *only*.  The other halves 
> are irrelevant.
>
> In the first case, the input to the lexer is the character sequence 
> 'a','b','a'.  In the second case, the input to the *parser* is the token 
> sequence A,B,A.  There is no fundamental difference here -- the rules for 
> dealing with that input appear the same and therefore should result in the 
> same output.  But they don't.
>
> As you said, the fundamental DFA engine behind the lexer and parser is the 
> same, so the difference must lie elsewhere.  And there is one difference: 
> the lexer has one additional implicit rule, the Tokens rule.  The parser 
> has no equivalent; or rather the closest equivalent it does have (the 
> 'program' rule) contains a cycle (*), so generates a very different DFA.
>
> Restating the examples a bit (and including the implicit rule) makes this 
> more obvious:
>
> lexer grammar Lex;
> KEYWORD : 'a' 'b' 'c';
> OTHER : 'a' | 'b' | 'c';
> Tokens : (  KEYWORD  |  OTHER  );
>
> parser grammar Parse;
> kEYWORD : A B C;
> oTHER : A | B | C;
> program : (  kEYWORD | oTHER  )*;
>
> The parser "works"; the lexer doesn't.  It has to be the Tokens rule.
>
> (Since the lexer example is explicitly showing an implicit rule, it's not 
> valid ANTLR -- if you actually want to run it in ANTLR, you'd need to do 
> one of the following:
> 1. remove the Tokens rule
> 2. make the KEYWORD and OTHER rules fragments, and possibly rename the 
> Tokens rule
> Similarly the parser would need to import its token vocab from somewhere.)
>