[antlr-interest] Multiple lexer tokens per rule
Ken Williams
ken.williams at thomsonreuters.com
Thu Jun 3 13:42:03 PDT 2010
Both the DAR book and the Javadoc
(http://www.antlr.org/api/ActionScript/org/antlr/runtime/Lexer.html#emitToke
n() ) mention that if you want to emit multiple tokens for a single lexer
rule, you need to override emit() or emitToken(). Does anyone have any
examples of doing that?
I assume nextToken() would also need to be overridden.
In case I have an XY Problem
(http://www.perlmonks.org/index.pl?node_id=542341), my use case is to parse
as in the following examples:
23 -> DIGITS
23, -> DIGITS PUNC
23,450 -> NUMERIC
23,450, -> NUMERIC PUNC
To do that, I'm using a lexer rule that consumes all the numeric & permitted
in-numeric punctuation, then I fix it up afterwards:
-----------------------
token : ...
| DIGITS
| NUMERIC -> {fixNum($text)}
| PUNC
PUNC : '-' | ',' | '.' ;
fragment DIGIT : '0'..'9' ;
NUMERIC : DIGIT (DIGIT | PUNC)*
{if ($text.matches("^[0-9]+$")) {$type=DIGITS;}} ;
-----------------------
My fixNum() method is trying to fix things up at the parser level, but I
really want to do it in the lexer.
An alternate solution might be to "push back" any trailing punctuation onto
the input stream. Not sure if that's possible?
--
Ken Williams
Sr. Research Scientist
Thomson Reuters
Phone: 651-848-7712
ken.williams at thomsonreuters.com
More information about the antlr-interest
mailing list