[antlr-interest] Creating a lexer that returns a token for bad characters

Sun Apr 27 13:24:50 PDT 2008

Ah, I hadn't thought of that.  Since I do need to recognize identifiers,
'iint' isn't actually a bad token, it's just not a keyword.  Thanks Gavin,
I'll try this out.

-----Original Message-----
From: Gavin Lambert [mailto:antlr at mirality.co.nz] 
Sent: Sunday, April 27, 2008 12:54 PM
To: Bryan H. Haber; antlr-interest at antlr.org
Subject: Re: [antlr-interest] Creating a lexer that returns a token for bad
characters

At 06:51 28/04/2008, Bryan H. Haber wrote:
>INT : 'int';
>WHITESPACE : (' ')+;
>
>And the input is 'int   iint'.  I would want a token stream of 
>INT('int'), WHITESPACE('   ') and BAD('iint').  I just got the 
>ANTLR book, but is such a thing possible?  It looks like I would 
>have to create a new nextToken() method that tracks the start of 
>the bad character, keeps consuming until it hits a valid 
>token.  I would then rollback that valid token and create a bad 
>token for part recorded.  Is there a better way to do this?  Any 
>help would be appreciated, thanks.

Try adding this as the last lexer rule:

   BAD: .+;

Though I *think* this won't do exactly what you want since it 
won't use whitespace as a delimiter; you should end up with 
INT('int'), WHITESPACE('   '), BAD('i'), INT('int').  I think.

Another option is just to add an ID rule for identifiers; then 
'iint' will match as an identifier and you can decide whether it's 
good or bad when it reaches the parser.  (This one will be 
whitespace delimited.)