[antlr-interest] Understanding priorities in lexing (newbie)
Daniel Brosseau
daniel at lba.ca
Thu Jul 12 15:18:05 PDT 2007
Hi,
Love this:
> Well, it does what I expected so it's "correct", just not what you want
> ;)
>
Case 1:
grammar lex;
KEYWORD : 'a' 'b' 'c';
OTHER : 'a' | 'b' | 'c';
program : ( KEYWORD | OTHER )*
Input: "aba" chokes on second a
Case 2:
grammar lex;
kEYWORD : 'a' 'b' 'c';
oTHER : 'a' | 'b' | 'c';
program : ( kEYWORD | oTHER )*
Input: "aba" outputs oTHER oTHER oTHER
Same grammar, two different state machines.
As I tried to say earlier, although the rules language used for the lexer
and parser seems to be describing things in the same manner, they in fact
describe very different state machines. So at the least this is an
inconsistency which leads to confusion.
>From every book I have ever read on language parsing, using the above
grammar, 'ab' does not match KEYWORD but unambigously matches only OTHER
OTHER. This is dealt with by the parser state machine (using kEYWORD and
oTHER instead of KEYWORD and OTHER) as oTHER oTHER and not as a missed
kEYWORD. So the lexer is producing state machines that interpret the same
gammar rules in totaly different ways from common usage and the parser's
usage.
As you say, this is correct because you say it is correct but it is
confusing, unconventional and hard. I can decide to speak my own language
and use the same words others do but give them different meanings. I can
even change their meaning from sentence to sentence and consider all of this
as correct if I am the corrector. But that's streaching it.
Its just that I need to figure out what other ramifications this has and if
it was at all possible to have the lexer and the parser give the same
meaning to their grammar rules, it would make it easier to think
consistently and use.
Regards,
Daniel
----- Original Message -----
From: "Terence Parr" <parrt at cs.usfca.edu>
To: "antlr-interest Interest" <antlr-interest at antlr.org>
Sent: Thursday, July 12, 2007 4:20 PM
Subject: Re: [antlr-interest] Understanding priorities in lexing (newbie)
>
> On Jul 12, 2007, at 1:11 PM, Gavin Lambert wrote:
>
>> At 07:46 13/07/2007, Terence Parr wrote:
>> >Hi Tom. Actually even if I did, OTHER OTHER matches 'ab' as
>> >does KEYWORD and so it has to resolve the ambiguity, which it does in
>> >favor of first rule specified.
>>
>> The point is that 'ab' *doesn't* match KEYWORD -- except in the mind of
>> the predictor, since it isn't checking the whole rule. So an input of
>> 'ab' should unambigously result in OTHER OTHER; an input of 'abc'
>> *could* result in either OTHER OTHER OTHER or KEYWORD, but the normal
>> "pick the longest match and/or the first listed" rules sort out that
>> ambiguity.
>
> Yes. ANTLR doesn't do the natural thing here. For normal cases, it's
> not an issue. Few tokens are prefixes like that. Normally it's keyword
> against 'a'..'z'+ not 'a'..'z'.
>
>> In the current implementation, though, the predictor sees 'ab' and
>> immediately declares "That must be a KEYWORD!" -- even when the input is
>> actually 'aba', whose only "correct" output would be OTHER OTHER OTHER.
>> So this results in an exception rather than producing the right output.
>
> Well, it does what I expected so it's "correct", just not what you want
> ;)
>
>> >It uses PROGRAM rule w/o the + because what if you had an error
>> >char?
>>
>> I'm not sure what you meant by this.
>
> I create Tokens : T1 | T2 | T3 ... ;
>
> for tokens to do matching.
>
>> >There is an implied loop to PROGRAM in nextToken() method.
>>
>> But the predictor doesn't know anything about it -- hence the problem.
>
> It assume any char because that is correct. You could put any char after
> a token, yes?
>
>> This whole thing makes it really hard to write correct lexers --
>> especially since ANTLR also seems to ignore predicates if it thinks it
>> knows better. If this one thing was fixed then it'd make ANTLR
>> significantly easier to use.
>
> (...)=> and {...}?=> will always be executed.
>
>> And I've been saying that for ages now :)
>
> And not reading about {...}?=> ? ;) They should always be executed.
>
> Ter
>
More information about the antlr-interest
mailing list