[antlr-interest] Understanding priorities in lexing (newbie)

Thu Jul 12 11:14:09 PDT 2007

On 7/13/07, Terence Parr <parrt at cs.usfca.edu> wrote:
>         From:     parrt at cs.usfca.edu
>         Subject:        Re: [antlr-interest] Understanding priorities in lexing
> (newbie)
>         Date:   July 12, 2007 10:24:19 AM PDT
>         To:       antlr-interest at antlr.org
>
>
> On Jul 12, 2007, at 9:01 AM, Daniel Brosseau wrote:
>
>
> > Hi,
> >
> > I am also new to ANTLR, I have read the book, want to thank you and
> > am quite excited. But this thread has me a little perplexed.
> >
> > If I have a simple grammar:
> >
> > grammar lex;
> >
> > KEYWORD : 'a' 'b' 'c';
> > OTHER : 'a' | 'b' | 'c';
> > program : (  KEYWORD  |  OTHER  )*
> >
> > and feed it  "abab" it chokes at the second 'a'.
> >
>
> Well, ANTLR uses priorities only whereas LEX did a "look for longest
> match" using a backtracking mechanism.  It's easy to do in a state
> machine, but less efficient using a recursive desc. lexer.  ANTLR
> says that 'ab' predicts KEYWORD but of course it gets tripped up as
> 'c' doens't follow in your input.  ANTLR can't see past OTHER to what
> follows so 'a' could be followed by anything (remember error input
> etc...).  Hence, it's confusion.  It does not autobacktrack at
> runtime ala LEX unless you give it a predicate.  Check what
> ANTLRWorks says the DFA is for predicting the overall list of tokens.
<SNIP>
>
> Easy in a state machine and hard withANTLR; it's building a recursive
> descent lexer.  Hard to backtrack across rules like that at the end
> of rule ala LEX.
I can see that ANTLR can't backtrack across rules, but could the
prediction calculation be changed to handle it?
With the grammar above, you get a mTokens method which basically
expresses the rule ( KEYWORD | OTHER ) as in the comment for this
method (i.e. all accessible tokens as alternates). For this it
generates a predictor based on the first two characters of input only.
Given the rule being coded this is correct, it's only alternatives are
to match a KEYWORD or a single OTHER token, so matching input 'ab' as
two OTHERs is not an option. But this seems counter-intuitive as of
course mTokens will be called multiple times. The natural intuition
seems to be that mTokens should be predicting based on
(KEYWORD|OTHER)+.
If you change the grammar to:
grammar lex2;
start
	:	PROGRAM;

PROGRAM : (  KEYWORD  |  OTHER  )+;

fragment KEYWORD : 'a' 'b' 'c';

fragment OTHER : 'a' | 'b' | 'c';

Then in PROGRAM you get a predictor that checks for 'abc' before
deciding that it is a KEYWORD.
Is it not possible to have the next token predictor function like this?
>
> Ter
>
Tom.