[antlr-interest] Writing (for now) just a lexer

Evan Driscoll driscoll at cs.wisc.edu
Wed Feb 11 15:42:19 PST 2009


Thanks for the responses before. I mention them below. I have a new
question though, which is that it doesn't seem to ignore
hidden/whitespace tokens.

I have this definition:
    NEWLINE : '\n'   { $channel=HIDDEN; }
            | '\r\n' { $channel=HIDDEN; }
            | '\r'   { $channel=HIDDEN; }
            ;
but it still gets returned by the nextToken() function.

Does the $channel=HIDDEN only work if you start using it in the context
of a lexer or something?

If you want to see code,
   Grammar: www.cs.wisc.edu/~cs536-1/projects/pr2/html/Tea.g.html
   Main fn: www.cs.wisc.edu/~cs536-1/projects/pr2/html/Test.java.html

(While I'm throwing out questions, is there a better way to do the
tokenNames() function in the main class? There's a getTokenNames()
method in the generated class, but it doesn't seem to work; my
impression is that it's generated for /parsers/ to contain the tokens
they refer to.)


Johannes Luber wrote:
> Furthermore, tokens with similar starting sequences need to be treated specially, as shown on < http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs>.

This is something I don't quite understand, because I can do this:
    INT_LITERAL
        : ('0'..'9')+
        ;

   FLOAT_LITERAL
        : ('0'..'9')+ '.' ('0'..'9')+
        ;
and it seems to just work. I can even use "INT_LITERAL '.' INT_LITERAL"
as the pattern for the float.

(I'm ignoring exponential numbers.)


Thomas Brandon wrote:
> You could emit your own token in a lexer rule in ANTLR which will
> prevent ANTLR generating it's own token. Something like:
> INT:
>     '0'..'9'+
>     {
>          Token t = new IntLiteralToken(input, state.type,
>                      state.channel, state.tokenStartCharIndex,
>                      getCharIndex()-1);
>          t.setLine(state.tokenStartLine);
>          t.setText(state.text);
>          t.setCharPositionInLine(state.tokenStartCharPositionInLine);
>          emit(t);
>     }
>     ;

This makes sense; thanks.


> You can easily isolate it in the parser through a parser rule that
> takes care of it like (check syntax):
> int: INT -> INT<IntLiteralNode>[$INT];
> Where the IntLiteralNode constructor takes care of parsing the text to
> get an int value. As you can see custom nodes are rather simpler to do
> than custom tokens. And I think having custom nodes rather than tokens
> would be the more standard practice as you would more often have most
> of your functionality in the AST rather than the tokens.

I'll give this some thought too. (Maybe my conceptual model is different
for not having worked with a combined lexer/parser or something like
that, or maybe for having done a lot of processing of ASTs that are
unrelated to actual parsing, but it seems more natural to me to have the
text->int conversion done outside of the AST constructor.)


> Ths issue of whether int range checking should be in the lexer or
> parser seems neither here nor there. It doesn't seem like something
> that should halt further processing or that will introduce any
> syntactic ambiguity so it doesn't seem like it *needs* to be in the
> lexer.

Well, it doesn't need to be in the lexer. From a compiler construction
standpoint, it seems like a toss-up to me. I'm more interested from a
pedagogical standpoint, and it's only a slight preference even then.

(You could also split up range detection so it's separate from actually
converting the number like the sample in the example Johannes Luber
linked to, and check the range in the lexer but actually convert it
later. But seems to me the best way to check the range is to try to
convert it anyway, so this seems the least attractive.)


Evan Driscoll



More information about the antlr-interest mailing list