[antlr-interest] Does lexer EVER use longest match?

Tue Oct 13 02:19:08 PDT 2009

(Kirby sent me a lengthy email on observations about ANTLR's lexer behavior, prompting me to nose around some more...)

Thanks Kirby for your lengthy comments.

Actually, I guess I was aware that mToken would indeed choose the longest *fixed-length* alternative -- what I was really puzzling over was patterns where the length of the match isn't fixed -- so some pattern with repeats or alternate combinations in the middle.

I now see that doing this can prompt ANTLR to produce a DFA for mToken to use (instead of ifs and switches), and this indeed seems to be able to find the longest match for alternative variable-length patterns.  Example:

-------------------------------
grammar Test02;

file
  : (X1 | X2 | X3 | X4 | X5 | X6)+ EOF
  ;

X1: 'ab';
X2: 'cd';
X3: 'ef';
X4: 'xy'; 
X5: 'abc' .* 'pq';
X6: X1 .* X4;
-------------------------------
(And noting the special non-greedy behavior of .*)

This correctly eats: 'abcdefpqxy' as X6, despite the possibility of getting sidetracked onto X5.

OK, I think I'm satisfied on this particular point. Within the scope of the lexer, it does at least make an effort to choose the longest match. I'm not sure what limitations there may be on this. (And obviously since lexing is independent of parsing, ANTLR's code does not make alternative lexer decisions based on ability to *parse* a larger chunk.)

-- Graham