[antlr-interest] Understanding priorities in lexing (newbie)

Thu Jul 12 10:41:17 PDT 2007

	From: 	  parrt at cs.usfca.edu
	Subject: 	Re: [antlr-interest] Understanding priorities in lexing  
(newbie)
	Date: 	July 12, 2007 10:24:19 AM PDT
	To: 	  antlr-interest at antlr.org

On Jul 12, 2007, at 9:01 AM, Daniel Brosseau wrote:

> Hi,
>
> I am also new to ANTLR, I have read the book, want to thank you and  
> am quite excited. But this thread has me a little perplexed.
>
> If I have a simple grammar:
>
> grammar lex;
>
> KEYWORD : 'a' 'b' 'c';
> OTHER : 'a' | 'b' | 'c';
> program : (  KEYWORD  |  OTHER  )*
>
> and feed it  "abab" it chokes at the second 'a'.
>

Well, ANTLR uses priorities only whereas LEX did a "look for longest  
match" using a backtracking mechanism.  It's easy to do in a state  
machine, but less efficient using a recursive desc. lexer.  ANTLR  
says that 'ab' predicts KEYWORD but of course it gets tripped up as  
'c' doens't follow in your input.  ANTLR can't see past OTHER to what  
follows so 'a' could be followed by anything (remember error input  
etc...).  Hence, it's confusion.  It does not autobacktrack at  
runtime ala LEX unless you give it a predicate.  Check what  
ANTLRWorks says the DFA is for predicting the overall list of tokens.

> Now I think I understand what was said earlier and I have gone  
> through the code and I can see why it chokes but I do not yet  
> understand why this is proper behaviour. Coming from a LEX  
> background, there should be no problem converting this into a DFA  
> that works, its done all the time.
>
> To illustrate, if I change my grammar to the following:
>
> grammar lex;
>
> kEYWORD : 'a' 'b' 'c';
> oTHER : 'a' | 'b' | 'c';
> program : (  kEYWORD |  oTHER  )*
>
> and feed it "abab" it parses the input as I would expect, no  
> problem, properly identifying a sequence of four oTHER tokens.
>

Well, if you look at lex__.g it put 'a', 'b', 'c' tokens first in the  
output (as it pulls out tokens from parser and gives to lexer before  
lexer rules).

> Isn't that what I should get by default, it looks much more natural  
> and expected. Using filter=true cannot be the right answer for  
> general cases like this.
>
> The lexer should e keeping track of the longest token it has  
> matched todate and return that token if it fails to match another  
> longer token.
>

Easy in a state machine and hard withANTLR; it's building a recursive  
descent lexer.  Hard to backtrack across rules like that at the end  
of rule ala LEX.

> Here, it does not do that. If I further change my grammar to:
>
> grammar lex;
>
> KEYWORD : 'a' 'b';
> OTHER : 'a' | 'b';
> program : (  KEYWORD  |  OTHER  )*
>
> and feed it "aa" it correctly splits it into two OTHER tokens. The  
> only difference between the first grammar and this grammar is that  
> the distance between the length of the last acceptable matched  
> token and where further matches fail goes from 1 ("a" vs "ab") to 2  
> ("a" vs "abc" ) characters. But that should not make a difference,  
> although I know why it does in your case.
>

Remember that ANTLR pretty much assumes you are expressing precedence  
in lists of alternatives and list of tokens.

Ter