[antlr-interest] [newbie] Lexer Confusion

Fri Jul 4 18:56:08 PDT 2008

At 09:46 5/07/2008, UW Student wrote:
 >1) In my original grammar, how did the lexer decide which rule 
to
 >attempt first?  Did it just pick the one that would result in 
the
 >longer match?

My understanding is that it prefers the longest match first; if it 
can't decide on that basis then it chooses the first listed one.

 >2) Can you please confirm my understanding of your use of a
 >syntactic predicate?  On a single DOT, the lexer will return
 >a TERM1 token.  On a double DOT, the lexer will return a TERM2
 >token.  If this is the case, won't a triple DOT be lexed as
 >TERM2 TERM1 (rather than the reverse)?

That's correct.  But isn't that what you want anyway?  (One thing 
that is different between the two: your original rule would turn 
'....' into a single TERM2 token, while Johannes' version would 
turn it into two TERM2 tokens.)

An unfortunate quirk of the recognition engine at the moment 
(which Johannes alluded to) is that once it is already "inside" a 
token, ANTLR tends to use only single character lookahead, 
especially when loop constructs are involved as well.  This is 
generally undesirable, but it can be worked around by merging 
rules like Johannes did (since using synpreds lets you specify 
arbitrary lookahead).  Hopefully it'll get resolved at some point 
:)