[antlr-interest] Fwd: Does lexer EVER use longest match?

Tue Oct 13 02:49:06 PDT 2009

Oops!  I meant to reply to the list, thanks for pointing out I failed
to do that.

On Mon, Oct 12, 2009 at 9:45 PM, Graham Wideman
<gwlist at grahamwideman.com> wrote:
> Hi folks:
>
> I've been scouring the list archives trying to scrape together a better understanding of what variants of lexer code ANTLR might generate.
>
> I keep seeing assertions that ANTLR lexers use longest match to select between otherwise ambiguous alternatives, and other assertions that it doesn't.
>
> In the code I see generated, I don't see any longest match logic, but maybe it only appears under particluar grammar arrangements or options?
>
> Or are people using "longest match" when they really mean "once mToken has chosen a rule, then that rule will act greedy" (unless greedy=false)?
>
> I'd appreciate if someone could clarify with an example that shows where longest match comes into play.  Or confirm that ANTLR actually doesn't use that strategy?
>

I believe it does use longest match, it can easily handle:

OPEN_TABLE: '{|';
OPEN_TEMPLATE: '{{';
OPEN_FUNC: '{{{';
OPEN_CURLY: '{';

It'll handle all of those without issue and '{{{' will generate
OPEN_FUNC not OPEN_TEMPLATE followed by OPEN_CURLY, or 3 OPEN_CURLY's.
 In fact, I don't think you even have to put them in the "proper"
order.  The big difference here is that there is no series of
characters where in the middle, it is invalid.  That is the problem
with most of these.  Folks expect that the lexer will backout to the
last valid point if there is a point where it could have previously
exited cleanly.  That just isn't the case.  The lexer finishes what it
starts, or errors out.  So don't let it start something unless you
know it can successfully finish it.

As to another threads assertion that the lexer is LL(1), that is not
the case.  If two lexer rules match for 20 characters, the lexer will
look ahead 21 character to identify which rule it is.  As long as it
is not LL ambiguous, I believe it's no big deal.  To prove that try
doing:

FA: .. 'a';
FB: .. 'b';
F_OTHER: .. ~('a'|'b');
start: (FA|FB|F_OTHER)+ EOF;

Put in 'ffa', 'ffb' 'ffc' and that should return FA, FB, F_OTHER
respectively, and in this case, it had to look ahead at least 3
characters to determine which rule it is.  It'll blow up if you don't
have a multiple of 3 characters, but that is a different problem
(starting something it can't finish).

So if you have rules like:

TILDA: '~'
T3: '~~~';
T4: '~~~~';
T5: '~~~~~';

If you give it input of '~', '~~~', '~~~~', '~~~~~', and '~~~~~~', it
will do what you expect (TILDA, T3, T4, T5, and T5 followed by TILDA
respectively).  However, the input of '~~' will screw up the lexer and
cause an exception.

I believe this is because each rule is thrown out once it no longer
matches.  So on the first '~', the DFA knows that all 4 rules are
potential outcomes.  Once it sees the second '~', it knows that only
T3, T4, or T5 are valid inputs.  The next input is EOF, which isn't a
valid option for any of the existing live rules, so the lexer punts an
error/exception.

You can easily use lookahead to make T3, T4, T5 invalid rules if there
aren't enough TILDA's prior to matching the first character, options
T3, T4, T5 are eliminated by the lookahead, TILDA is the only valid
option.  Upon the second '~', it doesn't match TILDA any longer and
TILDA was the last valid rule, so emit TILDA (had there been multiple
valid rules, it would have emitted the first one listed).  So emit one
TILDA and start lexing with the second '~' character.

The problem is when the lexer is "mid-rule", and then has a mismatch.
It does not go back to the last valid match at a boundary (in this
case above with no look ahead predicates, it would not fall back to a
single '~' and emit TILDA).  It is not at the end of *ANY* rule when
the EOF is the next character from the Lexer.  So you aren't at a
valid boundary case of any lexer rule, and the next character makes
all of the currently valid alternatives fail the lexer errors out.
This is also why Joan Pujol's lexer is having issues, in the other
thread.  The '$' character matches but the following '{' can't be
matched, and is not at a valid boundary condition.  Thus an exception
(the OTHER rule can only match one character, so by the second
character it is no longer a valid option, so TEXT was the only option,
it started the second alternative, and couldn't finish it).

I think if you generate the NFA/DFA's for the lexer, a lot of this
becomes more apparent, but I've always had problems.  Until sitting
down and thinking clearly about this and the problems I've had, the
lexer was mostly black magic.

Now if you enable backtracking on the lexer, it'll help with some of
these problems.  I can't say for sure what backtracking will do in the
lexer, but I know it does change behavior.

Hopefully somebody will correct me if I'm wrong, but with the problems
I've seen, and the solutions I've cooked up to work around them, I
believe this to be a mostly accurate description.

None of this is because I understand LL(*) parsing, it is mostly
empirically gained knowledge by screwing with a lexer that has lots of
these types of problems.  In the end, it's almost always easier to lex
single characters if it is ambiguous and you need context to figure it
out.  Deal with it in the parser, where backtracking works just like
you think it does.  I ended up throwing away all of the "{{{", "{{",
"{", and "{|" in the grammar I was working on and just lexed each "{"
and "|" separately, and then parsed them and created imaginary tokens.

Cheers,
  Kirby

> Thanks,
>
> Graham
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>