[antlr-interest] Creating a lexer that returns a token for bad characters

Gavin Lambert antlr at mirality.co.nz
Sun Apr 27 12:54:19 PDT 2008

At 06:51 28/04/2008, Bryan H. Haber wrote:
>INT : 'int';
>WHITESPACE : (' ')+;
>And the input is 'int   iint'.  I would want a token stream of 
>INT('int'), WHITESPACE('   ') and BAD('iint').  I just got the 
>ANTLR book, but is such a thing possible?  It looks like I would 
>have to create a new nextToken() method that tracks the start of 
>the bad character, keeps consuming until it hits a valid 
>token.  I would then rollback that valid token and create a bad 
>token for part recorded.  Is there a better way to do this?  Any 
>help would be appreciated, thanks.

Try adding this as the last lexer rule:

   BAD: .+;

Though I *think* this won't do exactly what you want since it 
won't use whitespace as a delimiter; you should end up with 
INT('int'), WHITESPACE('   '), BAD('i'), INT('int').  I think.

Another option is just to add an ID rule for identifiers; then 
'iint' will match as an identifier and you can decide whether it's 
good or bad when it reaches the parser.  (This one will be 
whitespace delimited.)

More information about the antlr-interest mailing list