[antlr-interest] Understanding priorities in lexing (newbie)

Thu Jul 12 16:18:36 PDT 2007

On Jul 12, 2007, at 3:18 PM, Daniel Brosseau wrote:
> Case 1:
> grammar lex;
> KEYWORD : 'a' 'b' 'c';
> OTHER : 'a' | 'b' | 'c';
> program : (  KEYWORD  |  OTHER  )*
>
> Input: "aba" chokes on second a
>
> Case 2:
> grammar lex;
> kEYWORD : 'a' 'b' 'c';
> oTHER : 'a' | 'b' | 'c';
> program : (  kEYWORD |  oTHER  )*
>
> Input: "aba" outputs oTHER oTHER oTHER
>
> Same grammar, two different state machines.

Well, more that the first grammar has 2 lexer rules and the second  
has three: one for each char.

> As I tried to say earlier, although the rules language used for the  
> lexer and parser seems to be describing things in the same manner,  
> they in fact describe very different state machines. So at the  
> least this is an inconsistency which leads to confusion.

It is in the interaction between the parser and lexer.  When you move  
rules wholesale into the parser from the lexer, shouldn't things be  
"different"?  This is NOT a scannerless parser.

>> From every book I have ever read on language parsing, using the above
> grammar, 'ab' does not match KEYWORD but unambigously matches only  
> OTHER OTHER.

Well, the (...)* decision in program will match ab as oTHER only if  
not followed by c.

> This is dealt with by the parser state machine (using kEYWORD and  
> oTHER instead of KEYWORD and OTHER) as oTHER oTHER and not as a  
> missed kEYWORD. So the lexer is producing state machines that  
> interpret the same gammar rules in totaly different ways from  
> common usage and the parser's usage.

Nope.  Move it to the lexer and DFA predicting (...)* is identical to  
that in the parser.

grammar lex;
KEYWORD : 'a' 'b' 'c';
OTHER : 'a' | 'b' | 'c';
Program : (  KEYWORD |  OTHER  )* ;

Trust me.  The same algorithm and code generator are used for both.   
One parsers tokens, the other char.

> As you say, this is correct because you say it is correct but it is  
> confusing, unconventional and hard. I can decide to speak my own  
> language and use the same words others do but give them different  
> meanings. I can even change their meaning from sentence to sentence  
> and consider all of this as correct if I am the corrector. But  
> that's streaching it.

Why are you picking on such a little example?  Try a real grammar or  
is this a condensation of a real grammar you're trying?

> Its just that I need to figure out what other ramifications this  
> has and if it was at all possible to have the lexer and the parser  
> give the same meaning to their grammar rules, it would make it  
> easier to think consistently and use.

They are identical...you are shuffling things between parser and  
lexer and saying it's not the same.  Of course not.

As a CA shrink would say, "just let go of LEX" ;)

Ter