[antlr-interest] Understanding priorities in lexing (newbie)
Thomas Brandon
tbrandonau at gmail.com
Thu Jul 12 11:14:09 PDT 2007
On 7/13/07, Terence Parr <parrt at cs.usfca.edu> wrote:
> From: parrt at cs.usfca.edu
> Subject: Re: [antlr-interest] Understanding priorities in lexing
> (newbie)
> Date: July 12, 2007 10:24:19 AM PDT
> To: antlr-interest at antlr.org
>
>
> On Jul 12, 2007, at 9:01 AM, Daniel Brosseau wrote:
>
>
> > Hi,
> >
> > I am also new to ANTLR, I have read the book, want to thank you and
> > am quite excited. But this thread has me a little perplexed.
> >
> > If I have a simple grammar:
> >
> > grammar lex;
> >
> > KEYWORD : 'a' 'b' 'c';
> > OTHER : 'a' | 'b' | 'c';
> > program : ( KEYWORD | OTHER )*
> >
> > and feed it "abab" it chokes at the second 'a'.
> >
>
> Well, ANTLR uses priorities only whereas LEX did a "look for longest
> match" using a backtracking mechanism. It's easy to do in a state
> machine, but less efficient using a recursive desc. lexer. ANTLR
> says that 'ab' predicts KEYWORD but of course it gets tripped up as
> 'c' doens't follow in your input. ANTLR can't see past OTHER to what
> follows so 'a' could be followed by anything (remember error input
> etc...). Hence, it's confusion. It does not autobacktrack at
> runtime ala LEX unless you give it a predicate. Check what
> ANTLRWorks says the DFA is for predicting the overall list of tokens.
<SNIP>
>
> Easy in a state machine and hard withANTLR; it's building a recursive
> descent lexer. Hard to backtrack across rules like that at the end
> of rule ala LEX.
I can see that ANTLR can't backtrack across rules, but could the
prediction calculation be changed to handle it?
With the grammar above, you get a mTokens method which basically
expresses the rule ( KEYWORD | OTHER ) as in the comment for this
method (i.e. all accessible tokens as alternates). For this it
generates a predictor based on the first two characters of input only.
Given the rule being coded this is correct, it's only alternatives are
to match a KEYWORD or a single OTHER token, so matching input 'ab' as
two OTHERs is not an option. But this seems counter-intuitive as of
course mTokens will be called multiple times. The natural intuition
seems to be that mTokens should be predicting based on
(KEYWORD|OTHER)+.
If you change the grammar to:
grammar lex2;
start
: PROGRAM;
PROGRAM : ( KEYWORD | OTHER )+;
fragment KEYWORD : 'a' 'b' 'c';
fragment OTHER : 'a' | 'b' | 'c';
Then in PROGRAM you get a predictor that checks for 'abc' before
deciding that it is a KEYWORD.
Is it not possible to have the next token predictor function like this?
>
> Ter
>
Tom.
More information about the antlr-interest
mailing list