[antlr-interest] Good practice for grammar with translated keywords

Thu Mar 12 08:17:38 PDT 2009

Olivier THIERRY wrote:
> Hi,
>
> I need to write a grammar for which keywords will be translated in
> english, french, spanish, ...
> Then I use StringTemplate to transform this language to Groovy script.
>
> For example I would have the following statement in english :
>
> IF (i = 0) THEN
>
> And the following in french :
>
> SI (i = 0) ALORS
>
> To do this I thought about writing :
> - many lexer grammar for keywords (i.e. translated tokens), one lexer
> grammar for each language
> - one lexer grammar for not translated tokens
> - one parser grammar that would import the not translated tokens lexer
> grammar and one of the translated tokens lexer grammar.
>
> Actually only the first lexer grammar is language specific, the other
> ones are common.
> But I can't find the right way to do this since tokens have to be
> imported in parser grammar. So it means you will have a parser grammar
> for each language.
>
> I also thought about using or statements in keywords tokens
> definition. Something like that : IF : 'IF' | 'SI';
> But it means you could mix languages, something like : IF (i=0) ALORS
>
> If anyone had the same need, how did he achieve this ?
>   
One way is to hand craft your lexer. This can then use a table of 
keywords, which you can load according to the current language settings. 
A reasonable way to see how to do this is to generate a lexer for just a 
small rule set, then see what it inherits from and what methods it 
implements etc.

One other way (and proably easier for you in this situation) assuming 
that there are not complications with lexical significance, is to not 
specify keywords in the lexer at all, but add action code to your ID 
rule that looks up the text that looks like it is an identifier and 
changes the token type if it is a keyword in the current language. 
Something like this:

// Define token symbols for use in tables and parser
//
fragment IF:;
fragment THEN:;

ID : ('a'..'z' | 'A'..'Z' | unicode characters for e acute and so on if 
keywords can use them) ('a'..'z'|'A'..'Z'|'0'..'9')
    {
       $type = checkKeyword($text);
    }

@lexer::members
{
   int checkKeyword(String id)
   {
        // Look up text in a HashMap that you have initialized and 
installed according to
        // current language.
        // If found, return the token type in the map, if not, return ID
        //
   }
}

Thoguh I show this inline with the lexer, the best way is to have the 
lexer inherit from a superclass and place teh code and table 
initializations in the super class. You will then have something like:

lexer grammar mylexer;

options {

    superClass     = MyLexer;

}
....
public  class AbstractMyLexer; extends org.antlr.runtime.Lexer {

    protected AbstractMyLexer;() {
    }

    protected AbstractMyLexer;(CharStream input) {
        super(input);
    }

    protected AbstractMyLexer;(CharStream input, RecognizerSharedState 
state) {
        super(input, state);
    }

  .... initialize your tables in the constructors above..
  ... implement support methods...

Now, you program your parser with the ENGLISH token names (or French if 
you prefer of course), but the text of the token will always be the 
definition in the current language (so you can use the symbolic name for 
parsing and error lookups, but the token text for error messages.

Initialize the HashMaps so that their values are always IF or THEN etc, 
but their keys are the token text for the current language:

toktab_fr {
'SI' : IF,
 'ALORS' : THEN
}

toktab_en {
 'IF' : IF,
 'THEN' : THEN
}

Hope that helps,

Jim
PS: You will probably find the superclass stuff easiest if you are not 
familiar with lexers or implementing ANTLR lexers by hand.

Jim