[antlr-interest] How to swich the recognition scope in Lexer

Jim Idle jimi at temporal-wave.com
Wed Jun 20 07:16:18 PDT 2007


Silvester,

Michael was correct - the precedence thing you are asking about is
really just a question about what the lexer will return.

So: 

1) Remember that the lexer and parser have no connection or interaction.
The lexer is just there to tokenize and if you have it set up so that
'MYTOKEN' always returns token MYTOKEN, then it always will whatever the
rule in the parser that looks for one does.
2) So, construct your lexer with all the keywords and things placed in
this order (some of this is arbitrary but if you just always use this
order you should be fine):

Keywords
Operators etc
STRING_LITERALS
Identifiers

3) Now create a parser rule that lists all the keywords.
4) Create a parser rule keyw_or_id
5) Whenever a keyword is able to be used as an identifier, use the
parser rule keyw_or_id rather than IDENTIFIER;
6) Watch out for places where this causes ambiguity (unfortunately a
number of languages are like this) and solve with predicates.


It can be tricky, but works.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Silvester Pozarnik
> Sent: Wednesday, June 20, 2007 6:12 AM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] How to swich the recognition scope in
> Lexer
> 
> >> Silvester Pozarnik wrote this on [20 June 2007 13:00]:
> >>
> >> In the antlr 2.7.7 you could change the behaviour of Lexer so
> >> that tokens are recognized az literals in special cases by
> >> overriding the
> >> testLiteralsTable() method in CharScanner class. How to the
> >> same in antlr 3.0 if you have a grammar as:
> >>
> >> 	grammar test;
> >> 	tokens {
> >> 		MYTOKEN = 'mytoken';
> >> 	}
> >> 	mygrammar:
> >> 		{
> >> 		MYTOKEN LPAREN IDENTIFIER RPAREN
> >> 		}
> >>
> >> 	LPAREN   : '(' ;
> >> 	RPAREN   : ')' ;
> >> 	IDENTIFIER
> >> 		: ('a'..'z' | 'A'..'Z' | '\u0080'..'\ufffe') (
> >> Letter | Digit)*;
> >>
> >> 	fragment Letter
> >> 		: 'a'..'z' | 'A'..'Z' | '_' |'-' |  '\u0080'..'\ufffe';
> >>
> >> 	fragment Digit
> >> 		: '0'..'9';
> >>
> >> So that the input "mytoken(mytoken)" is a valid. The first
> >> 'mytoken' should be recognized as MYTOKEN, but the second
> >> 'mytoken' has to be recognized as an IDENTIFIER. Is there a
> >> way to achieve this?
> 
> 
> >
> >Not to my knowledge (and this applies to V2.x too). Is suspect you
> need
> to
> >change your 'mygrammar' rule:
> >
> >	mygrammar : MYTOKEN LPAREN (MYTOKEN|IDENTIFIER) RPAREN
> >
> >Micheal
> 
> Hei Micheal,
> 
> The way you proposed to change the rule would not work as it is still
> undeterministic when processed by Lexer ("should I recognize an
> IDENTIFIER or MYTOKEN!?). I'm not sure what takes precedence here. The
> proposed parser rule also alter the nature of language. This was
anyway
> just an example - the more general problem is that in some languages
> you
> may need that the key words are under some condition (scope)
recognized
> as literals (e.g "...City=Kansas City, ... Idol=Joe Idol etc.).
> 
> In the 2.7.7 you could fix this by adding to your lexer definition:
> 
> class Testlexer extends Lexer;
> 
> {
>   private static List<String> ident_stack = new LinkedList<String>();
> 
>   // Test the token text against the literals table
>   // Override this method to perform a different literals test
>   public int testLiteralsTable(int ttype) {
>     if (ident_stack.size() >= 1 &&
>        "mygrammar".compareToIgnoreCase(
>           ident_stack.get(ident_stack.size()-1) ) == 0) {
>        ident_stack.add(text.toString());
>        return ttype;
>     }
>     ident_stack.add(text.toString());
>     // this is the original stuff
>     hashString.setBuffer(text.getBuffer(), text.length());
>     Integer literalsIndex = (Integer)literals.get(hashString);
>     if (literalsIndex != null) {
>       ttype = literalsIndex.intValue();
>     }
>     return ttype;
>   }
> }
> 
> 
> I could of course redefine a rule as:
> 
> mygrammar : MYTOKEN LPAREN STRINGVALUE RPAREN;
> ...
> STRINGVALUE
> 	:	'\'' ( ~('\''|'\\') )* '\''
> 	;
> 
> But then I have to change the already established syntax of my
> language.
> Any help?
> 
> BR.
> Silvester Pozarnik


More information about the antlr-interest mailing list