[antlr-interest] Understanding priorities in lexing (newbie)

Thu Jul 12 11:41:09 PDT 2007

Thanks so much for your reply.

I'm still trying to get a grasp of this. If I take the two first grammars
below, the only difference in the expression of the grammars is that the
first uses KEYWORD and OTHER while the second uses kEYWORD and oTHER. As
would be commonly understood, both express exactly the same grammar but get
translated into very different state machines, one by the lexer and one by
the parser. This I guess is my first source of confusion. The lexer rules
are clearly not a normal gramatical expression, but now I am trying to
understand what it is they are expressing: my second area of confusion. Are
they just a statement of a raw DFA? This mean that anytime a token has as a
prefix another token the lexer will choke. Such shared prefixes are not
always evident (as was the case of the grammar that started this thread:
'int' vs '.') and can be quite common (between keywords and identifiers for 
example).
But now I have to figure out what other cases may not turn
out as I might have expected. That's ok, but a little tricky. In any event,
although ANTLR presents lexer rules and parser rules in a way that makes
them look very similar they are evidently very different beasts indeed.

Regards,

Daniel

----- Original Message ----- 
From: "Terence Parr" <parrt at cs.usfca.edu>
To: "antlr-interest Interest" <antlr-interest at antlr.org>
Sent: Thursday, July 12, 2007 1:41 PM
Subject: Re: [antlr-interest] Understanding priorities in lexing (newbie)

> From:   parrt at cs.usfca.edu
> Subject: Re: [antlr-interest] Understanding priorities in lexing  (newbie)
> Date: July 12, 2007 10:24:19 AM PDT
> To:   antlr-interest at antlr.org
>
>
> On Jul 12, 2007, at 9:01 AM, Daniel Brosseau wrote:
>
>
>> Hi,
>>
>> I am also new to ANTLR, I have read the book, want to thank you and  am 
>> quite excited. But this thread has me a little perplexed.
>>
>> If I have a simple grammar:
>>
>> grammar lex;
>>
>> KEYWORD : 'a' 'b' 'c';
>> OTHER : 'a' | 'b' | 'c';
>> program : (  KEYWORD  |  OTHER  )*
>>
>> and feed it  "abab" it chokes at the second 'a'.
>>
>
> Well, ANTLR uses priorities only whereas LEX did a "look for longest 
> match" using a backtracking mechanism.  It's easy to do in a state 
> machine, but less efficient using a recursive desc. lexer.  ANTLR  says 
> that 'ab' predicts KEYWORD but of course it gets tripped up as  'c' 
> doens't follow in your input.  ANTLR can't see past OTHER to what  follows 
> so 'a' could be followed by anything (remember error input  etc...). 
> Hence, it's confusion.  It does not autobacktrack at  runtime ala LEX 
> unless you give it a predicate.  Check what  ANTLRWorks says the DFA is 
> for predicting the overall list of tokens.
>
>> Now I think I understand what was said earlier and I have gone  through 
>> the code and I can see why it chokes but I do not yet  understand why 
>> this is proper behaviour. Coming from a LEX  background, there should be 
>> no problem converting this into a DFA  that works, its done all the time.
>>
>> To illustrate, if I change my grammar to the following:
>>
>> grammar lex;
>>
>> kEYWORD : 'a' 'b' 'c';
>> oTHER : 'a' | 'b' | 'c';
>> program : (  kEYWORD |  oTHER  )*
>>
>> and feed it "abab" it parses the input as I would expect, no  problem, 
>> properly identifying a sequence of four oTHER tokens.
>>
>
> Well, if you look at lex__.g it put 'a', 'b', 'c' tokens first in the 
> output (as it pulls out tokens from parser and gives to lexer before 
> lexer rules).
>
>
>> Isn't that what I should get by default, it looks much more natural  and 
>> expected. Using filter=true cannot be the right answer for  general cases 
>> like this.
>>
>> The lexer should e keeping track of the longest token it has  matched 
>> todate and return that token if it fails to match another  longer token.
>>
>
> Easy in a state machine and hard withANTLR; it's building a recursive 
> descent lexer.  Hard to backtrack across rules like that at the end  of 
> rule ala LEX.
>
>
>> Here, it does not do that. If I further change my grammar to:
>>
>> grammar lex;
>>
>> KEYWORD : 'a' 'b';
>> OTHER : 'a' | 'b';
>> program : (  KEYWORD  |  OTHER  )*
>>
>> and feed it "aa" it correctly splits it into two OTHER tokens. The  only 
>> difference between the first grammar and this grammar is that  the 
>> distance between the length of the last acceptable matched  token and 
>> where further matches fail goes from 1 ("a" vs "ab") to 2  ("a" vs 
>> "abc" ) characters. But that should not make a difference,  although I 
>> know why it does in your case.
>>
>
> Remember that ANTLR pretty much assumes you are expressing precedence  in 
> lists of alternatives and list of tokens.
>
> Ter
>