[antlr-interest] Good practice for grammar with translated keywords

Olivier THIERRY olivier.thierry at gmail.com
Thu Mar 12 09:55:20 PDT 2009


2009/3/12 Olivier THIERRY <olivier.thierry at gmail.com>:
> 2009/3/12 Jim Idle <jimi at temporal-wave.com>:
>> Olivier THIERRY wrote:
>>> Hi,
>>>
>>> I need to write a grammar for which keywords will be translated in
>>> english, french, spanish, ...
>>> Then I use StringTemplate to transform this language to Groovy script.
>>>
>>> For example I would have the following statement in english :
>>>
>>> IF (i = 0) THEN
>>>
>>> And the following in french :
>>>
>>> SI (i = 0) ALORS
>>>
>>> To do this I thought about writing :
>>> - many lexer grammar for keywords (i.e. translated tokens), one lexer
>>> grammar for each language
>>> - one lexer grammar for not translated tokens
>>> - one parser grammar that would import the not translated tokens lexer
>>> grammar and one of the translated tokens lexer grammar.
>>>
>>> Actually only the first lexer grammar is language specific, the other
>>> ones are common.
>>> But I can't find the right way to do this since tokens have to be
>>> imported in parser grammar. So it means you will have a parser grammar
>>> for each language.
>>>
>>> I also thought about using or statements in keywords tokens
>>> definition. Something like that : IF : 'IF' | 'SI';
>>> But it means you could mix languages, something like : IF (i=0) ALORS
>>>
>>> If anyone had the same need, how did he achieve this ?
>>>
>> One way is to hand craft your lexer. This can then use a table of
>> keywords, which you can load according to the current language settings.
>> A reasonable way to see how to do this is to generate a lexer for just a
>> small rule set, then see what it inherits from and what methods it
>> implements etc.
>>
>> One other way (and proably easier for you in this situation) assuming
>> that there are not complications with lexical significance, is to not
>> specify keywords in the lexer at all, but add action code to your ID
>> rule that looks up the text that looks like it is an identifier and
>> changes the token type if it is a keyword in the current language.
>> Something like this:
>>
>>
>> // Define token symbols for use in tables and parser
>> //
>> fragment IF:;
>> fragment THEN:;
>>
>> ID : ('a'..'z' | 'A'..'Z' | unicode characters for e acute and so on if
>> keywords can use them) ('a'..'z'|'A'..'Z'|'0'..'9')
>>    {
>>       $type = checkKeyword($text);
>>    }
>>
>>
>> @lexer::members
>> {
>>   int checkKeyword(String id)
>>   {
>>        // Look up text in a HashMap that you have initialized and
>> installed according to
>>        // current language.
>>        // If found, return the token type in the map, if not, return ID
>>        //
>>   }
>> }
>>
>> Thoguh I show this inline with the lexer, the best way is to have the
>> lexer inherit from a superclass and place teh code and table
>> initializations in the super class. You will then have something like:
>>
>> lexer grammar mylexer;
>>
>> options {
>>
>>    superClass     = MyLexer;
>>
>> }
>> ....
>> public  class AbstractMyLexer; extends org.antlr.runtime.Lexer {
>>
>>    protected AbstractMyLexer;() {
>>    }
>>
>>    protected AbstractMyLexer;(CharStream input) {
>>        super(input);
>>    }
>>
>>    protected AbstractMyLexer;(CharStream input, RecognizerSharedState
>> state) {
>>        super(input, state);
>>    }
>>
>>  .... initialize your tables in the constructors above..
>>  ... implement support methods...
>>
>> Now, you program your parser with the ENGLISH token names (or French if
>> you prefer of course), but the text of the token will always be the
>> definition in the current language (so you can use the symbolic name for
>> parsing and error lookups, but the token text for error messages.
>>
>> Initialize the HashMaps so that their values are always IF or THEN etc,
>> but their keys are the token text for the current language:
>>
>> toktab_fr {
>> 'SI' : IF,
>>  'ALORS' : THEN
>> }
>>
>> toktab_en {
>>  'IF' : IF,
>>  'THEN' : THEN
>> }
>>
>> Hope that helps,
>>
>> Jim
>> PS: You will probably find the superclass stuff easiest if you are not
>> familiar with lexers or implementing ANTLR lexers by hand.
>>
>> Jim
>>
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>>
>
> Thanks a lot for your suggestions. I will now try to understand better
> since I am quite a newbie with Antlr.
>
> Regards,
>
> Olivier
>

I tried it and it works great !
Note I wrote all code in @members part instead of using a superclass
for lexer because this superclass can't compile because it misses
constants defined fotr tokens.

Thanks a lot for your help.

Olivier


More information about the antlr-interest mailing list