[antlr-interest] Look-ahead problem parsing phrase?

Sun Jun 28 18:35:46 PDT 2009

Greetings!

On Sun, 2009-06-28 at 16:04 -0700, Sean O'Dell wrote:
> Thanks for the help Gavin.
> 
> Let me see if I understand this correctly; I'm trying to get my head
> down into this.
> 
> On "top-level lexer rules should not refer to other top-level lexer
> rules":  What constitutes a "top-level lexer rule?" Is that fragment
> vs. non-fragment (non-fragments are top-level)? 

Yes.

>  Also, if one top-level lexer rule refers to another, but adds
> additional rules, doesn't that remove the ambiguity?

No. Because Antlr lexers do not backtrack.

Harkening back your original question --- and this is from memory
because i deleted your original mail already, sorry.

you had some rules that did not correctly recognize trailing whitespace
(e.g. " foo bar baz \n" is supposed to be okay but isn't):

line : PHRASE WS? EOL? ;

PHRASE : WORD (WS WORD)* ; 

maybe the above is not your situation, sorry....

anyway hopefully the above is close enuf so that the following applies:

so we see the "foo" as a WORD and realize it is the first part of a
PHRASE, all is good. Next we see a WS so we start the loop for
recognizing the remaining, optional, WORDs in the PHRASE, and so all is
good for recognizing "bar" as the second WORD of the PHRASE. Next we see
a WS and we are in a loop recognizing the remaining WORDs of the PHRASE
and so all is good for recognizing "baz" as the third word of the
phrase. Next we see a WS and we are in a loop for recognizing the
remaining WORDs of the PHRASE but there is no WORD after that WS but our
rule for PHRASE demands that there be a WORD after the WS in the loop;
so we have a lexing error.

the lexer will not see that maybe the trailing WS in this PHRASE isn't
really part of the PHRASE but should part of whatever token occurs after
this PHRASE.

e.g. Antlr lexers do not backtrack.

this differs from my experience with regular expressions, which i think
is where you said you were coming from... anyway...

Antlr parsers are much better at dealing with this situation. Antlr
parsers support an arbitrary, say k, symbol lookahead for determining
which path through the parse is required (and the Tool will figure out
for itself how far to look ahead e.g. what k should be for each decision
point).

Which is why people will tell you to make your complicated lexer rules
into parser rules.  I think that if you had given the above rules for
your line and phrase as:

line : phrase WS? EOL? ;
phrase : WORD (WS WORD)* ;

then Antlr would have just go ahead and create a parser without any
problem (UNTESTED).

Please allow me to re-emphasize some excellent advise that Gavin Lambert
gave:

>         >On Sun, Jun 28, 2009 at 2:41 PM, Gavin Lambert wrote:
>                 >It's important that given any particular input in
>                 isolation, >there should be one and only one possible
>                 token that can be >produced for it.  Doing anything
>                 else is just letting >yourself in for a world of pain.

Regarding your query re: empty lexer rules:

> On ""EOL should be a parser rule":  Can you help me understand how a
> rule which can match zero characters creates an infinite loop?  

Empty top-level lexer rules are evil.

The lexer selects a rule for the next Token based upon the very first
next character in the input stream.

If there is no lexer rule that can recognize that next character as its
first, then a lexing error is announced.

But if there is a lexer rule that can match zero characters. Then, when
seeing an invalid initial character, the lexer will (correctly) emit an
instance of that empty token, since clearly an empty token is valid in
front of any error. And now we are back to the top level loop which sees
no valid token for our first character yet sees that an empty token is
valid, emits that empty token and loops to see an invalid character but
we can emit an empty token so we emit that empty token and then we are
back to the top level loop that sees an invalid character but we can
emit and empty token we emit that empty token and then we are back to
the top level loop.....

> Does making it a parser rule avoid that? 

maybe.

>  If so, how?

only because the parser is generally smarter about lookahead (and by
user option, backtracking under infinite lookahead).

Hope this helps....
   -jbb

> On Sun, Jun 28, 2009 at 2:41 PM, Gavin Lambert <antlr at mirality.co.nz>
> wrote:
>         At 09:21 29/06/2009, Sean O'Dell wrote:
>                 Why should lexer rules not refer to other lexer rules
>                 without being fragments?  I've read that doing so only
>                 prevented token creation.  It affects logic, as well?
>         
>         
>         The moment you have one top-level lexer rule referring to
>         another top-level rule, you introduce ambiguity -- you're
>         basically telling the lexer "given this input, produce one of
>         these two tokens but I don't care which", and then in the
>         parser you're expecting exactly one of those tokens.
>          Sometimes you'll happen to pick the right one and it'll
>         parse.  Sometimes you won't.  Sometimes the rules are
>         sufficiently different that given certain input it produces
>         one token and given other input it produces the other.  Then
>         you're basically screwed.
>         
>         It's important that given any particular input in isolation,
>         there should be one and only one possible token that can be
>         produced for it.  Doing anything else is just letting yourself
>         in for a world of pain.
>         
>         
>         Also, your EOL rule was a top-level lexer rule that can
>         successfully match zero characters.  Doing that creates
>         infinite loops, and is something else that must be avoided.
>          (Which is another reason why it should be a parser rule.)
>