[antlr-interest] Creating a lexer that returns a token for bad characters

Bryan H. Haber bryan.haber at gmail.com
Sun Apr 27 13:24:50 PDT 2008


Ah, I hadn't thought of that.  Since I do need to recognize identifiers,
'iint' isn't actually a bad token, it's just not a keyword.  Thanks Gavin,
I'll try this out.

-----Original Message-----
From: Gavin Lambert [mailto:antlr at mirality.co.nz] 
Sent: Sunday, April 27, 2008 12:54 PM
To: Bryan H. Haber; antlr-interest at antlr.org
Subject: Re: [antlr-interest] Creating a lexer that returns a token for bad
characters

At 06:51 28/04/2008, Bryan H. Haber wrote:
>INT : 'int';
>WHITESPACE : (' ')+;
>
>And the input is 'int   iint'.  I would want a token stream of 
>INT('int'), WHITESPACE('   ') and BAD('iint').  I just got the 
>ANTLR book, but is such a thing possible?  It looks like I would 
>have to create a new nextToken() method that tracks the start of 
>the bad character, keeps consuming until it hits a valid 
>token.  I would then rollback that valid token and create a bad 
>token for part recorded.  Is there a better way to do this?  Any 
>help would be appreciated, thanks.

Try adding this as the last lexer rule:

   BAD: .+;

Though I *think* this won't do exactly what you want since it 
won't use whitespace as a delimiter; you should end up with 
INT('int'), WHITESPACE('   '), BAD('i'), INT('int').  I think.

Another option is just to add an ID rule for identifiers; then 
'iint' will match as an identifier and you can decide whether it's 
good or bad when it reaches the parser.  (This one will be 
whitespace delimited.)



More information about the antlr-interest mailing list