[antlr-interest] Understanding priorities in lexing (newbie)

Thu Jul 12 15:18:05 PDT 2007

Hi,

Love this:

> Well, it does what I expected so it's "correct", just not what you  want 
> ;)
>

Case 1:
grammar lex;
KEYWORD : 'a' 'b' 'c';
OTHER : 'a' | 'b' | 'c';
program : (  KEYWORD  |  OTHER  )*

Input: "aba" chokes on second a

Case 2:
grammar lex;
kEYWORD : 'a' 'b' 'c';
oTHER : 'a' | 'b' | 'c';
program : (  kEYWORD |  oTHER  )*

Input: "aba" outputs oTHER oTHER oTHER

Same grammar, two different state machines.

As I tried to say earlier, although the rules language used for the lexer 
and parser seems to be describing things in the same manner, they in fact 
describe very different state machines. So at the least this is an 
inconsistency which leads to confusion.

>From every book I have ever read on language parsing, using the above 
grammar, 'ab' does not match KEYWORD but unambigously matches only OTHER 
OTHER. This is dealt with by the parser state machine (using kEYWORD and 
oTHER instead of KEYWORD and OTHER) as oTHER oTHER and not as a missed 
kEYWORD. So the lexer is producing state machines that interpret the same 
gammar rules in totaly different ways from common usage and the parser's 
usage.

As you say, this is correct because you say it is correct but it is 
confusing, unconventional and hard. I can decide to speak my own language 
and use the same words others do but give them different meanings. I can 
even change their meaning from sentence to sentence and consider all of this 
as correct if I am the corrector. But that's streaching it.

Its just that I need to figure out what other ramifications this has and if 
it was at all possible to have the lexer and the parser give the same 
meaning to their grammar rules, it would make it easier to think 
consistently and use.

Regards,

Daniel

----- Original Message ----- 
From: "Terence Parr" <parrt at cs.usfca.edu>
To: "antlr-interest Interest" <antlr-interest at antlr.org>
Sent: Thursday, July 12, 2007 4:20 PM
Subject: Re: [antlr-interest] Understanding priorities in lexing (newbie)

>
> On Jul 12, 2007, at 1:11 PM, Gavin Lambert wrote:
>
>> At 07:46 13/07/2007, Terence Parr wrote:
>> >Hi Tom.  Actually even if I did, OTHER OTHER matches 'ab' as
>> >does KEYWORD and so it has to resolve the ambiguity, which it does in
>> >favor of first rule specified.
>>
>> The point is that 'ab' *doesn't* match KEYWORD -- except in the  mind of 
>> the predictor, since it isn't checking the whole rule.  So  an input of 
>> 'ab' should unambigously result in OTHER OTHER; an  input of 'abc' 
>> *could* result in either OTHER OTHER OTHER or  KEYWORD, but the normal 
>> "pick the longest match and/or the first  listed" rules sort out that 
>> ambiguity.
>
> Yes.  ANTLR doesn't do the natural thing here.  For normal cases,  it's 
> not an issue.  Few tokens are prefixes like that.  Normally it's  keyword 
> against 'a'..'z'+ not 'a'..'z'.
>
>> In the current implementation, though, the predictor sees 'ab' and 
>> immediately declares "That must be a KEYWORD!" -- even when the  input is 
>> actually 'aba', whose only "correct" output would be OTHER  OTHER OTHER. 
>> So this results in an exception rather than producing  the right output.
>
> Well, it does what I expected so it's "correct", just not what you  want 
> ;)
>
>> >It uses PROGRAM rule w/o the + because what if you had an error
>> >char?
>>
>> I'm not sure what you meant by this.
>
> I create Tokens : T1 | T2 | T3 ... ;
>
> for tokens to do matching.
>
>> >There is an implied loop to PROGRAM in nextToken() method.
>>
>> But the predictor doesn't know anything about it -- hence the problem.
>
> It assume any char because that is correct.  You could put any char  after 
> a token, yes?
>
>> This whole thing makes it really hard to write correct lexers --  
>> especially since ANTLR also seems to ignore predicates if it thinks  it 
>> knows better.  If this one thing was fixed then it'd make ANTLR 
>> significantly easier to use.
>
> (...)=> and {...}?=> will always be executed.
>
>> And I've been saying that for ages now :)
>
> And not reading about {...}?=> ? ;)  They should always be executed.
>
> Ter
>