[antlr-interest] Understanding priorities in lexing (newbie)
Daniel Brosseau
daniel at lba.ca
Thu Jul 12 11:41:09 PDT 2007
Thanks so much for your reply.
I'm still trying to get a grasp of this. If I take the two first grammars
below, the only difference in the expression of the grammars is that the
first uses KEYWORD and OTHER while the second uses kEYWORD and oTHER. As
would be commonly understood, both express exactly the same grammar but get
translated into very different state machines, one by the lexer and one by
the parser. This I guess is my first source of confusion. The lexer rules
are clearly not a normal gramatical expression, but now I am trying to
understand what it is they are expressing: my second area of confusion. Are
they just a statement of a raw DFA? This mean that anytime a token has as a
prefix another token the lexer will choke. Such shared prefixes are not
always evident (as was the case of the grammar that started this thread:
'int' vs '.') and can be quite common (between keywords and identifiers for
example).
But now I have to figure out what other cases may not turn
out as I might have expected. That's ok, but a little tricky. In any event,
although ANTLR presents lexer rules and parser rules in a way that makes
them look very similar they are evidently very different beasts indeed.
Regards,
Daniel
----- Original Message -----
From: "Terence Parr" <parrt at cs.usfca.edu>
To: "antlr-interest Interest" <antlr-interest at antlr.org>
Sent: Thursday, July 12, 2007 1:41 PM
Subject: Re: [antlr-interest] Understanding priorities in lexing (newbie)
> From: parrt at cs.usfca.edu
> Subject: Re: [antlr-interest] Understanding priorities in lexing (newbie)
> Date: July 12, 2007 10:24:19 AM PDT
> To: antlr-interest at antlr.org
>
>
> On Jul 12, 2007, at 9:01 AM, Daniel Brosseau wrote:
>
>
>> Hi,
>>
>> I am also new to ANTLR, I have read the book, want to thank you and am
>> quite excited. But this thread has me a little perplexed.
>>
>> If I have a simple grammar:
>>
>> grammar lex;
>>
>> KEYWORD : 'a' 'b' 'c';
>> OTHER : 'a' | 'b' | 'c';
>> program : ( KEYWORD | OTHER )*
>>
>> and feed it "abab" it chokes at the second 'a'.
>>
>
> Well, ANTLR uses priorities only whereas LEX did a "look for longest
> match" using a backtracking mechanism. It's easy to do in a state
> machine, but less efficient using a recursive desc. lexer. ANTLR says
> that 'ab' predicts KEYWORD but of course it gets tripped up as 'c'
> doens't follow in your input. ANTLR can't see past OTHER to what follows
> so 'a' could be followed by anything (remember error input etc...).
> Hence, it's confusion. It does not autobacktrack at runtime ala LEX
> unless you give it a predicate. Check what ANTLRWorks says the DFA is
> for predicting the overall list of tokens.
>
>> Now I think I understand what was said earlier and I have gone through
>> the code and I can see why it chokes but I do not yet understand why
>> this is proper behaviour. Coming from a LEX background, there should be
>> no problem converting this into a DFA that works, its done all the time.
>>
>> To illustrate, if I change my grammar to the following:
>>
>> grammar lex;
>>
>> kEYWORD : 'a' 'b' 'c';
>> oTHER : 'a' | 'b' | 'c';
>> program : ( kEYWORD | oTHER )*
>>
>> and feed it "abab" it parses the input as I would expect, no problem,
>> properly identifying a sequence of four oTHER tokens.
>>
>
> Well, if you look at lex__.g it put 'a', 'b', 'c' tokens first in the
> output (as it pulls out tokens from parser and gives to lexer before
> lexer rules).
>
>
>> Isn't that what I should get by default, it looks much more natural and
>> expected. Using filter=true cannot be the right answer for general cases
>> like this.
>>
>> The lexer should e keeping track of the longest token it has matched
>> todate and return that token if it fails to match another longer token.
>>
>
> Easy in a state machine and hard withANTLR; it's building a recursive
> descent lexer. Hard to backtrack across rules like that at the end of
> rule ala LEX.
>
>
>> Here, it does not do that. If I further change my grammar to:
>>
>> grammar lex;
>>
>> KEYWORD : 'a' 'b';
>> OTHER : 'a' | 'b';
>> program : ( KEYWORD | OTHER )*
>>
>> and feed it "aa" it correctly splits it into two OTHER tokens. The only
>> difference between the first grammar and this grammar is that the
>> distance between the length of the last acceptable matched token and
>> where further matches fail goes from 1 ("a" vs "ab") to 2 ("a" vs
>> "abc" ) characters. But that should not make a difference, although I
>> know why it does in your case.
>>
>
> Remember that ANTLR pretty much assumes you are expressing precedence in
> lists of alternatives and list of tokens.
>
> Ter
>
More information about the antlr-interest
mailing list