[antlr-interest] Creating a lexer that returns a token for bad characters
Gavin Lambert
antlr at mirality.co.nz
Sun Apr 27 12:54:19 PDT 2008
At 06:51 28/04/2008, Bryan H. Haber wrote:
>INT : 'int';
>WHITESPACE : (' ')+;
>
>And the input is 'int iint'. I would want a token stream of
>INT('int'), WHITESPACE(' ') and BAD('iint'). I just got the
>ANTLR book, but is such a thing possible? It looks like I would
>have to create a new nextToken() method that tracks the start of
>the bad character, keeps consuming until it hits a valid
>token. I would then rollback that valid token and create a bad
>token for part recorded. Is there a better way to do this? Any
>help would be appreciated, thanks.
Try adding this as the last lexer rule:
BAD: .+;
Though I *think* this won't do exactly what you want since it
won't use whitespace as a delimiter; you should end up with
INT('int'), WHITESPACE(' '), BAD('i'), INT('int'). I think.
Another option is just to add an ID rule for identifiers; then
'iint' will match as an identifier and you can decide whether it's
good or bad when it reaches the parser. (This one will be
whitespace delimited.)
More information about the antlr-interest
mailing list