[antlr-interest] Good practice for grammar with translated keywords

Thu Mar 12 08:25:46 PDT 2009

2009/3/12 Jim Idle <jimi at temporal-wave.com>:
> Olivier THIERRY wrote:
>> Hi,
>>
>> I need to write a grammar for which keywords will be translated in
>> english, french, spanish, ...
>> Then I use StringTemplate to transform this language to Groovy script.
>>
>> For example I would have the following statement in english :
>>
>> IF (i = 0) THEN
>>
>> And the following in french :
>>
>> SI (i = 0) ALORS
>>
>> To do this I thought about writing :
>> - many lexer grammar for keywords (i.e. translated tokens), one lexer
>> grammar for each language
>> - one lexer grammar for not translated tokens
>> - one parser grammar that would import the not translated tokens lexer
>> grammar and one of the translated tokens lexer grammar.
>>
>> Actually only the first lexer grammar is language specific, the other
>> ones are common.
>> But I can't find the right way to do this since tokens have to be
>> imported in parser grammar. So it means you will have a parser grammar
>> for each language.
>>
>> I also thought about using or statements in keywords tokens
>> definition. Something like that : IF : 'IF' | 'SI';
>> But it means you could mix languages, something like : IF (i=0) ALORS
>>
>> If anyone had the same need, how did he achieve this ?
>>
> One way is to hand craft your lexer. This can then use a table of
> keywords, which you can load according to the current language settings.
> A reasonable way to see how to do this is to generate a lexer for just a
> small rule set, then see what it inherits from and what methods it
> implements etc.
>
> One other way (and proably easier for you in this situation) assuming
> that there are not complications with lexical significance, is to not
> specify keywords in the lexer at all, but add action code to your ID
> rule that looks up the text that looks like it is an identifier and
> changes the token type if it is a keyword in the current language.
> Something like this:
>
>
> // Define token symbols for use in tables and parser
> //
> fragment IF:;
> fragment THEN:;
>
> ID : ('a'..'z' | 'A'..'Z' | unicode characters for e acute and so on if
> keywords can use them) ('a'..'z'|'A'..'Z'|'0'..'9')
>    {
>       $type = checkKeyword($text);
>    }
>
>
> @lexer::members
> {
>   int checkKeyword(String id)
>   {
>        // Look up text in a HashMap that you have initialized and
> installed according to
>        // current language.
>        // If found, return the token type in the map, if not, return ID
>        //
>   }
> }
>
> Thoguh I show this inline with the lexer, the best way is to have the
> lexer inherit from a superclass and place teh code and table
> initializations in the super class. You will then have something like:
>
> lexer grammar mylexer;
>
> options {
>
>    superClass     = MyLexer;
>
> }
> ....
> public  class AbstractMyLexer; extends org.antlr.runtime.Lexer {
>
>    protected AbstractMyLexer;() {
>    }
>
>    protected AbstractMyLexer;(CharStream input) {
>        super(input);
>    }
>
>    protected AbstractMyLexer;(CharStream input, RecognizerSharedState
> state) {
>        super(input, state);
>    }
>
>  .... initialize your tables in the constructors above..
>  ... implement support methods...
>
> Now, you program your parser with the ENGLISH token names (or French if
> you prefer of course), but the text of the token will always be the
> definition in the current language (so you can use the symbolic name for
> parsing and error lookups, but the token text for error messages.
>
> Initialize the HashMaps so that their values are always IF or THEN etc,
> but their keys are the token text for the current language:
>
> toktab_fr {
> 'SI' : IF,
>  'ALORS' : THEN
> }
>
> toktab_en {
>  'IF' : IF,
>  'THEN' : THEN
> }
>
> Hope that helps,
>
> Jim
> PS: You will probably find the superclass stuff easiest if you are not
> familiar with lexers or implementing ANTLR lexers by hand.
>
> Jim
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>

Thanks a lot for your suggestions. I will now try to understand better
since I am quite a newbie with Antlr.

Regards,

Olivier