[antlr-interest] Writing (for now) just a lexer

Wed Feb 11 06:08:28 PST 2009

On Wed, Feb 11, 2009 at 7:52 PM, Evan Driscoll <driscoll at cs.wisc.edu> wrote:

> Hi all,
>
> I'm teaching a compilers class at University of Wisconsin-Madison.
> Traditionally the class has followed sort of a classic sequence of
> projects 'write a lexer', 'write a parser', etc., and in the past has
> used either JLex/Java CUP or Flex/Bison for the lexer and parser
> generator. This is my first time teaching this class, and I'm writing
> these assignments assuming the use of ANTLR instead. I don't really want
> to make major changes to the class, so I want to keep these assignments
> separate, but the combined nature of ANTLR grammars has thrown a couple
> oddities into the way this works. Anyway, this is the one I'm not really
> sure how to deal with, as I'm also new to ANTLR.
>
> The question is this: how do I store additional information in a token?
> (E.g. for the token corresponding to an int literal, how would I store
> the value as an int?)
>
> Using something like Flex, I know how to do this; just add an additional
> option in the union representing the token type. But under ANTLR, I'm
> not sure. I see "How do I use a custom token type?"
> (http://www.antlr.org/wiki/pages/viewpage.action?pageId=1844), but this
> isn't quite what I want, as I want to be able to return a subclass of
> CommonToken for just a couple particular rules.

You could emit your own token in a lexer rule in ANTLR which will prevent
ANTLR generating it's own token. Something like:
INT:
    '0'..'9'+
    {    Token t = new IntLiteralToken(input, state.type, state.channel,
state.tokenStartCharIndex, getCharIndex()-1);
         t.setLine(state.tokenStartLine);
         t.setText(state.text);
         t.setCharPositionInLine(state.tokenStartCharPositionInLine);
         emit(t);
    }
    ;

>
>
> The couple grammars I've looked at (for Java) don't do this, presumably
> leaving the string->integer conversion for later, but this doesn't make
> a whole lot of sense to be to be honest. There are potentially multiple
> contexts where this sort of thing would need to be done later, while
> doing it in lexing seems cleaner. It also allows me to keep better
> consistency with the fact that I've been giving "an integer literal is
> too large" as an example of an error that could arise during lexing.
> (Not that you *couldn't* do it later.)

You can easily isolate it in the parser through a parser rule that takes
care of it like (check syntax):
int: INT -> INT<IntLiteralNode>[$INT];
Where the IntLiteralNode constructor takes care of parsing the text to get
an int value.
As you can see custom nodes are rather simpler to do than custom tokens. And
I think having custom nodes rather than tokens would be the more standard
practice as you would more often have most of your functionality in the AST
rather than the tokens.
Ths issue of whether int range checking should be in the lexer or parser
seems neither here nor there. It doesn't seem like something that should
halt further processing or that will introduce any syntactic ambiguity so it
doesn't seem like it *needs* to be in the lexer.

Tom.

>
>
> Evan Driscoll
>
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090212/0d0225e3/attachment.html