[antlr-interest] How to swich the recognition scope in Lexer

Wed Jun 20 06:12:14 PDT 2007

>> Silvester Pozarnik wrote this on [20 June 2007 13:00]:
>> 
>> In the antlr 2.7.7 you could change the behaviour of Lexer so 
>> that tokens are recognized az literals in special cases by 
>> overriding the
>> testLiteralsTable() method in CharScanner class. How to the 
>> same in antlr 3.0 if you have a grammar as:
>> 
>> 	grammar test;
>> 	tokens {
>> 		MYTOKEN = 'mytoken';
>> 	}
>> 	mygrammar:
>> 		{ 
>> 		MYTOKEN LPAREN IDENTIFIER RPAREN 
>> 		}
>> 
>> 	LPAREN   : '(' ;
>> 	RPAREN   : ')' ;
>> 	IDENTIFIER 
>> 		: ('a'..'z' | 'A'..'Z' | '\u0080'..'\ufffe') ( 
>> Letter | Digit)*;
>>     
>> 	fragment Letter
>> 		: 'a'..'z' | 'A'..'Z' | '_' |'-' |  '\u0080'..'\ufffe';
>> 
>> 	fragment Digit
>> 		: '0'..'9';    
>> 
>> So that the input "mytoken(mytoken)" is a valid. The first 
>> 'mytoken' should be recognized as MYTOKEN, but the second 
>> 'mytoken' has to be recognized as an IDENTIFIER. Is there a 
>> way to achieve this?

>
>Not to my knowledge (and this applies to V2.x too). Is suspect you need
to
>change your 'mygrammar' rule:
>
>	mygrammar : MYTOKEN LPAREN (MYTOKEN|IDENTIFIER) RPAREN 
>
>Micheal

Hei Micheal,

The way you proposed to change the rule would not work as it is still
undeterministic when processed by Lexer ("should I recognize an
IDENTIFIER or MYTOKEN!?). I'm not sure what takes precedence here. The
proposed parser rule also alter the nature of language. This was anyway
just an example - the more general problem is that in some languages you
may need that the key words are under some condition (scope) recognized
as literals (e.g "...City=Kansas City, ... Idol=Joe Idol etc.).

In the 2.7.7 you could fix this by adding to your lexer definition:

class Testlexer extends Lexer;

{
  private static List<String> ident_stack = new LinkedList<String>();

  // Test the token text against the literals table
  // Override this method to perform a different literals test
  public int testLiteralsTable(int ttype) {
    if (ident_stack.size() >= 1 &&
       "mygrammar".compareToIgnoreCase(
          ident_stack.get(ident_stack.size()-1) ) == 0) {
       ident_stack.add(text.toString());
       return ttype;
    }
    ident_stack.add(text.toString());
    // this is the original stuff
    hashString.setBuffer(text.getBuffer(), text.length());
    Integer literalsIndex = (Integer)literals.get(hashString);
    if (literalsIndex != null) {
      ttype = literalsIndex.intValue();
    }
    return ttype;
  }
}

I could of course redefine a rule as:

mygrammar : MYTOKEN LPAREN STRINGVALUE RPAREN;
...
STRINGVALUE
	:	'\'' ( ~('\''|'\\') )* '\'' 
	;

But then I have to change the already established syntax of my language.
Any help?

BR.
Silvester Pozarnik