[antlr-interest] Creating a lexer that returns a token for bad characters

Sun Apr 27 12:54:19 PDT 2008

At 06:51 28/04/2008, Bryan H. Haber wrote:
>INT : 'int';
>WHITESPACE : (' ')+;
>
>And the input is 'int   iint'.  I would want a token stream of 
>INT('int'), WHITESPACE('   ') and BAD('iint').  I just got the 
>ANTLR book, but is such a thing possible?  It looks like I would 
>have to create a new nextToken() method that tracks the start of 
>the bad character, keeps consuming until it hits a valid 
>token.  I would then rollback that valid token and create a bad 
>token for part recorded.  Is there a better way to do this?  Any 
>help would be appreciated, thanks.

Try adding this as the last lexer rule:

   BAD: .+;

Though I *think* this won't do exactly what you want since it 
won't use whitespace as a delimiter; you should end up with 
INT('int'), WHITESPACE('   '), BAD('i'), INT('int').  I think.

Another option is just to add an ID rule for identifiers; then 
'iint' will match as an identifier and you can decide whether it's 
good or bad when it reaches the parser.  (This one will be 
whitespace delimited.)