[antlr-interest] Look-ahead problem parsing phrase?

Sun Jun 28 21:11:18 PDT 2009

(sorry if this is a duplicate; gmail defaulted to the wrong sender address,
and the list manager complained; not sure if my response got to the list)

John,

Thanks ... that really did help. I think I didn't realize how much better
the parser is than the lexer at looking-ahead. It makes much more sense to
me now, though I'm not yet sure how I will deal with tokenizing optional
trailing whitespace.

I think, though, if I understand correctly: the lexer rule I build to
consume that should not be allowed to be empty. However, if it's optional, I
should indicate that in a parser rule and not the lexer rule.

Maybe another way to say this is (and maybe it's been said, but I didn't
"hear" it correctly): lexer rules should strive to be completely unambiguous
and should match something, preferably immediately from the left. Parser
rules can have more complex look-ahead patterns.

I really appreciate everyone's help. I know how frustrating it can be
answering "getting started" questions, especially from someone who doesn't
quite yet "get it."

Sean- Hide quoted text -

On Jun 28, 2009 6:35pm, "John B. Brodie" <jbb at acm.org> wrote:
> Greetings!
>
>
>
> On Sun, 2009-06-28 at 16:04 -0700, Sean O'Dell wrote:
>
> > Thanks for the help Gavin.
>
> >
>
> > Let me see if I understand this correctly; I'm trying to get my head
>
> > down into this.
>
> >
>
> > On "top-level lexer rules should not refer to other top-level lexer
>
> > rules": What constitutes a "top-level lexer rule?" Is that fragment
>
> > vs. non-fragment (non-fragments are top-level)?
>
>
>
> Yes.
>
>
>
> > Also, if one top-level lexer rule refers to another, but adds
>
> > additional rules, doesn't that remove the ambiguity?
>
>
>
> No. Because Antlr lexers do not backtrack.
>
>
>
> Harkening back your original question --- and this is from memory
>
> because i deleted your original mail already, sorry.
>
>
>
> you had some rules that did not correctly recognize trailing whitespace
>
> (e.g. " foo bar baz \n" is supposed to be okay but isn't):
>
>
>
> line : PHRASE WS? EOL? ;
>
>
>
> PHRASE : WORD (WS WORD)* ;
>
>
>
> maybe the above is not your situation, sorry....
>
>
>
> anyway hopefully the above is close enuf so that the following applies:
>
>
>
> so we see the "foo" as a WORD and realize it is the first part of a
>
> PHRASE, all is good. Next we see a WS so we start the loop for
>
> recognizing the remaining, optional, WORDs in the PHRASE, and so all is
>
> good for recognizing "bar" as the second WORD of the PHRASE. Next we see
>
> a WS and we are in a loop recognizing the remaining WORDs of the PHRASE
>
> and so all is good for recognizing "baz" as the third word of the
>
> phrase. Next we see a WS and we are in a loop for recognizing the
>
> remaining WORDs of the PHRASE but there is no WORD after that WS but our
>
> rule for PHRASE demands that there be a WORD after the WS in the loop;
>
> so we have a lexing error.
>
>
>
> the lexer will not see that maybe the trailing WS in this PHRASE isn't
>
> really part of the PHRASE but should part of whatever token occurs after
>
> this PHRASE.
>
>
>
> e.g. Antlr lexers do not backtrack.
>
>
>
> this differs from my experience with regular expressions, which i think
>
> is where you said you were coming from... anyway...
>
>
>
> Antlr parsers are much better at dealing with this situation. Antlr
>
> parsers support an arbitrary, say k, symbol lookahead for determining
>
> which path through the parse is required (and the Tool will figure out
>
> for itself how far to look ahead e.g. what k should be for each decision
>
> point).
>
>
>
> Which is why people will tell you to make your complicated lexer rules
>
> into parser rules. I think that if you had given the above rules for
>
> your line and phrase as:
>
>
>
> line : phrase WS? EOL? ;
>
> phrase : WORD (WS WORD)* ;
>
>
>
> then Antlr would have just go ahead and create a parser without any
>
> problem (UNTESTED).
>
>
>
>
>
>
>
> Please allow me to re-emphasize some excellent advise that Gavin Lambert
>
> gave:
>
>
>
> > >On Sun, Jun 28, 2009 at 2:41 PM, Gavin Lambert wrote:
>
> > >It's important that given any particular input in
>
> > isolation, >there should be one and only one possible
>
> > token that can be >produced for it. Doing anything
>
> > else is just letting >yourself in for a world of pain.
>
>
>
>
>
>
>
>
>
> Regarding your query re: empty lexer rules:
>
>
>
> > On ""EOL should be a parser rule": Can you help me understand how a
>
> > rule which can match zero characters creates an infinite loop?
>
>
>
> Empty top-level lexer rules are evil.
>
>
>
> The lexer selects a rule for the next Token based upon the very first
>
> next character in the input stream.
>
>
>
> If there is no lexer rule that can recognize that next character as its
>
> first, then a lexing error is announced.
>
>
>
> But if there is a lexer rule that can match zero characters. Then, when
>
> seeing an invalid initial character, the lexer will (correctly) emit an
>
> instance of that empty token, since clearly an empty token is valid in
>
> front of any error. And now we are back to the top level loop which sees
>
> no valid token for our first character yet sees that an empty token is
>
> valid, emits that empty token and loops to see an invalid character but
>
> we can emit an empty token so we emit that empty token and then we are
>
> back to the top level loop that sees an invalid character but we can
>
> emit and empty token we emit that empty token and then we are back to
>
> the top level loop.....
>
>
>
> > Does making it a parser rule avoid that?
>
>
>
> maybe.
>
>
>
> > If so, how?
>
>
>
> only because the parser is generally smarter about lookahead (and by
>
> user option, backtracking under infinite lookahead).
>
>
>
> Hope this helps....
>
> -jbb
>
>
>
>
>
> > On Sun, Jun 28, 2009 at 2:41 PM, Gavin Lambert antlr at mirality.co.nz>
>
> > wrote:
>
> > At 09:21 29/06/2009, Sean O'Dell wrote:
>
> > Why should lexer rules not refer to other lexer rules
>
> > without being fragments? I've read that doing so only
>
> > prevented token creation. It affects logic, as well?
>
> >
>
> >
>
> > The moment you have one top-level lexer rule referring to
>
> > another top-level rule, you introduce ambiguity -- you're
>
> > basically telling the lexer "given this input, produce one of
>
> > these two tokens but I don't care which", and then in the
>
> > parser you're expecting exactly one of those tokens.
>
> > Sometimes you'll happen to pick the right one and it'll
>
> > parse. Sometimes you won't. Sometimes the rules are
>
> > sufficiently different that given certain input it produces
>
> > one token and given other input it produces the other. Then
>
> > you're basically screwed.
>
> >
>
> > It's important that given any particular input in isolation,
>
> > there should be one and only one possible token that can be
>
> > produced for it. Doing anything else is just letting yourself
>
> > in for a world of pain.
>
> >
>
> >
>
> > Also, your EOL rule was a top-level lexer rule that can
>
> > successfully match zero characters. Doing that creates
>
> > infinite loops, and is something else that must be avoided.
>
> > (Which is another reason why it should be a parser rule.)
>
> >
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090628/b6d40117/attachment.html