[antlr-interest] Multiple lexer tokens per rule

Ken Williams ken.williams at thomsonreuters.com
Thu Jun 3 13:42:03 PDT 2010

Both the DAR book and the Javadoc
n() ) mention that if you want to emit multiple tokens for a single lexer
rule, you need to override emit() or emitToken().  Does anyone have any
examples of doing that?

I assume nextToken() would also need to be overridden.

In case I have an XY Problem
(http://www.perlmonks.org/index.pl?node_id=542341), my use case is to parse
as in the following examples:

23      -> DIGITS
23,     -> DIGITS PUNC
23,450  -> NUMERIC
23,450, -> NUMERIC PUNC

To do that, I'm using a lexer rule that consumes all the numeric & permitted
in-numeric punctuation, then I fix it up afterwards:

token    : ...
    | DIGITS 
    | NUMERIC -> {fixNum($text)}
    | PUNC

PUNC   : '-' | ',' | '.' ;
fragment DIGIT    : '0'..'9' ;
        {if ($text.matches("^[0-9]+$")) {$type=DIGITS;}} ;

My fixNum() method is trying to fix things up at the parser level, but I
really want to do it in the lexer.

An alternate solution might be to "push back" any trailing punctuation onto
the input stream.  Not sure if that's possible?

Ken Williams
Sr. Research Scientist
Thomson Reuters
Phone: 651-848-7712
ken.williams at thomsonreuters.com

More information about the antlr-interest mailing list