[antlr-interest] Forcing the lexer to never error

Fri Jun 15 15:13:41 PDT 2012

Hello all,

  This is all using ANTLR 3.4 with the C target. I'm trying to modify
my lexer grammar to never trigger a lexer error but instead emit a
special token, INVALID. So far I've done this by adding all invalid
sequences of characters to a special rule INVALID.

ASCOLCOLAS                 : '*::*';

INVALID :
 ...
 | '*:'
 | '*::';

This works but it gets tedious for certain complex lexer rules. For
instance the rule for a line directive is as follows:

DIR_LINE :
  'line'  SLSpace+ DecDigits SLSpace+ StrChars SLSpace+ DecDigits SLSpace* '\n'

To handle this I'd have to add a fairly complex alternative to the INVALID rule

INVALID:
  ...
 | 'line'
  (
    ~SLSpace
  | SLSpace+
   (
     ~DecDigits
   | DecDigits ...
   )
  )

I also tried adding alternatives to the DIR_LINE rule instead.
Unfortunately ANTLR sometimes fails to generate the code in this case,
even after letting it run for several minutes. I also don't have a way
to set the token type to INVALID. ANTLR places the token type
assignment after any lexer rules actions, overriding my changes.

DIR_LINE :
  'line'
  (
    SLSpace+
    (
      DecDigits
      (
       ...
      |
      )
    | ~DecDigits {LEXSTATE->type = INVALID;} //This gets ignored in the C code
    )
  | ~SLSpace {ctx->lineError();}
  )

My first question is, are there performance issues caused by adding
the separate INVALID rule as opposed to alternative in existing rules?
My understanding is yes since lookahead is needed to determine whether
REALNUM or INVALID should be entered, for instance.

Secondly, is there a way to force the token type based on a rule action?

Thanks