[antlr-interest] Tokens

Fri Nov 27 20:50:30 PST 2009

You can, and should, override the type in the parser easily if the output of the parser is a tree parser. Then your tree parser is much simpler and will not need a special id rule in the tree grammar, which will reduce the complexity (and code size) of the tree parser grammar that you will need to walk it. 

Don't try to do any manipulation in the lexer unless it is trivial and very deterministic, such as keywords only being keywords after some delimiter, or at the start of a line and so on.

All you need for a tree producing parser (which is generally what you should be using), is:

id
	: ID
	| k=keywords  ->ID[$k.start]
	;

You could also use the tree node API to set the type if that feels clearer to you.

For a parser that does not produce a tree, just do (off the top of my head):

keywords
   : (
         k=A
       | k=B
        ... etc
     )
     {
         $k.setType(ID);
     }
   ;

While you can deal with the token without changing its type, unless you need to know that it was a keyword, it is probably simpler for debugging etc if you change it to be an ID type.

For lexers, just follow:

First list all the known, deterministic things, such as keywords, followed by any general rules that would otherwise match the same things:

K1 : 'K1' ;
K2 : 'K2' ;
...

ID : ('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z'| '0'..'9')* ;

Where there is not a conflict between things of indeterminate length, then ANTLR will work out what it has to do to distinguish anyway, such as '/' vs '/=' and so on, though for clarity you might list the longer sequences first.

Play with the order of a few simple lexer rules, and you will soon pick it up from the fact that ANTLR gives you a warning or an error:

ID : ('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z'| '0'..'9')* ;
K1 : 'K1';

[20:50:00] error(208): T.g:16:1: The following token definitions can never be matched because prior tokens match the same input: K1

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Kevin J. Cummings
> Sent: Friday, November 27, 2009 6:18 PM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Tokens
> 
> On 11/27/2009 08:39 PM, David-Sarah Hopwood wrote:
> > Kevin J. Cummings wrote:
> >> On 11/27/2009 06:05 PM, Ronald Sok wrote:
> >>> Kevin J. Cummings wrote:
> >>>> You are close.  What you have here is keywords as opposed to
> reserved
> >>>> words.  When implementing the former, you will need to do
> something like
> >>>> (at least this is what I do using ANTLR 2.7.7):
> >>>>
> >>>> id : ID
> >>>>    | k:keyword
> >>>>       { #k->setType(ID); }
> >>>>       // This changes the token type of a keyword to an ID
> >>>>    ;
> >>>>
> >>>> keyword
> >>>>    : APPLE | PEAR | ORANGE
> >>>>    ;
> >>>>
> >>>> someName
> >>>>    :     'Name:' id NEWLINE
> >>>>    ;
> >>>>
> >>>> You could reduce the number of productions by folding, but the
> principal
> >>>> of changing the token type of keywords is what is important here.
> And
> >>>> you may have to find out how to do this with ANTLR 3.x.
> >>>
> >>> Ok, I tried to change this into ANTLR 3 syntax, but ran into the
> fact that
> >>> the result of keyword is a subtype of ParserRuleReturnScope, which
> >>
> >> Sorry, my bad, should be KEYWORD and done in the lexer, not the
> parser!
> >
> > That won't work because either:
> >  - KEYWORD is before APPLE, PEAR and ORANGE, in which case it always
> takes
> >    priority and the type of a Token will never be APPLE, PEAR or
> ORANGE;
> >
> >  - or, KEYWORD is after APPLE, PEAR and ORANGE, in which case those
> rules
> >    take priority and the type of a Token will never initially be
> KEYWORD.
> >    You could override it, but if you do that in a lexer rule then you
> >    don't have enough context to determine what it should be (and
> '$type ='
> >    can't be used in a parser rule).
> 
> Hmmm, in the context I used it in, I had to be able to tell when a
> keyword was being used as an identifier.  In ANTLR-2, I could override
> the token's type in the parser.  (Makes expression evaluation a whole
> lot easier further down the line when checking for an identifier....)
> 
> > As I said in my other followup, it's usually not necessary to change
> the
> > type (but you can do so using the code given in that post if you
> want).
> 
> --
> Kevin J. Cummings
> kjchome at rcn.com
> cummings at kjchome.homeip.net
> cummings at kjc386.framingham.ma.us
> Registered Linux User #1232 (http://counter.li.org)
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
> email-address