[antlr-interest] More on Lexer 2-char seq handling

Mon Oct 12 19:17:05 PDT 2009

Graham Wideman wrote:
> Hi folks:
> 
> Further to the discussion on lexer matching sequence that should stop before some multi-character pattern:
> 
> I read Kirby's post with interest, including the list discussions pointed to.  I'm not sure what to make of it.  The oddity to me is that ANTLR *almost* generates the right things:
> 
> 1. mTokens does the right thing.
> 
> 2. The lexer rule code that matches/consumes the string in question does look ahead and see the error it would make if it consumed the end-before-this pattern.
> 
> 3. ANTLR just doesn't generate the code to look ahead and *predict* that it should *stop*, it only looks ahead enough to predict which alternative *might* succeed based on the first character.

Yes, I get the impression that ANTLR lexers use a weaker recognition
strategy than ANTLR parsers. The problems seem to occur when you try to do
something in a lexer, that would work in a parser (with tokens in place of
characters) only because of the stronger recognition strategy.

However, I haven't been able to find documentation of what exactly the
difference is -- the description of LL(*) in the 'Definitive Guide'
chapter 11 does not seem to make a distinction between lexing and parsing.
(It does say that ANTLR does not generate ambiguity warnings for a lexer
that it would generate for a parser, instead preferring rules that are
specified first in the grammar. But that doesn't seem to be relevant to
this lookahead issue.)

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com