[antlr-interest] How to swich the recognition scope in Lexer

Wed Jun 20 06:51:42 PDT 2007

On 6/20/07, Silvester Pozarnik <silvester.pozarnik at tracetracker.com> wrote:
> >> Silvester Pozarnik wrote this on [20 June 2007 13:00]:
> >>
> >> In the antlr 2.7.7 you could change the behaviour of Lexer so
> >> that tokens are recognized az literals in special cases by
> >> overriding the
> >> testLiteralsTable() method in CharScanner class. How to the
> >> same in antlr 3.0 if you have a grammar as:
> >>
> >>      grammar test;
> >>      tokens {
> >>              MYTOKEN = 'mytoken';
> >>      }
> >>      mygrammar:
> >>              {
> >>              MYTOKEN LPAREN IDENTIFIER RPAREN
> >>              }
> >>
> >>      LPAREN   : '(' ;
> >>      RPAREN   : ')' ;
> >>      IDENTIFIER
> >>              : ('a'..'z' | 'A'..'Z' | '\u0080'..'\ufffe') (
> >> Letter | Digit)*;
> >>
> >>      fragment Letter
> >>              : 'a'..'z' | 'A'..'Z' | '_' |'-' |  '\u0080'..'\ufffe';
> >>
> >>      fragment Digit
> >>              : '0'..'9';
> >>
> >> So that the input "mytoken(mytoken)" is a valid. The first
> >> 'mytoken' should be recognized as MYTOKEN, but the second
> >> 'mytoken' has to be recognized as an IDENTIFIER. Is there a
> >> way to achieve this?
>
>
> >
> >Not to my knowledge (and this applies to V2.x too). Is suspect you need
> to
> >change your 'mygrammar' rule:
> >
> >       mygrammar : MYTOKEN LPAREN (MYTOKEN|IDENTIFIER) RPAREN
> >
> >Micheal
>
> Hei Micheal,
>
> The way you proposed to change the rule would not work as it is still
> undeterministic when processed by Lexer ("should I recognize an
> IDENTIFIER or MYTOKEN!?). I'm not sure what takes precedence here. The
> proposed parser rule also alter the nature of language.
>
> BR.
> Silvester Pozarnik
>

In ANTLR 3 lexers the rule which is mentioned first will take
precedence with no warnings given. Literals specified in tokens
section have precedence over explicit lexer rules. So MYTOKEN will
take precedence. As far as I can see Michael's proposed solution
should work fine for your needs. To generalise you could do something
like:

mygrammar: MYTOKEN1 LPAREN idOrKeyword RPAREN;
idOrKeyword: IDENTIFIER|MYTOKEN1|MYTOKEN2 {LT(-1).setType(IDENTIFIER);};

where MYTOKEN1, MYTOKEN2 etc are your keywords then when keywords are
allowed you use idOrKeyword rather than IDENTIFIER. The action (unsure
of exact syntax there) means later phases don't need to deal with
this.
Or you can have keywords recognised as IDENTIFIER in your lexer and
then use predicates to test the text in your parser. Something like:

mygrammar: myToken LPAREN IDENTIFIER RPAREN;
myToken: {input.LT(1).getText() == "mytoken"}? IDENTIFIER {
input.LT(-1).setType(MYTOKEN);};

Tom.