[antlr-interest] Q: Advice on localizing lexer

Sun Jun 14 13:28:38 PDT 2009

C. Mundi wrote:
> I have created a very simple DSL with antlr for pseudo natural language.
> Nothing special.
> 
> It currently recognizes the usual identifiers:
> 
>                            ID  :    ('a'..'z'|'A'..'Z'|'_')
> ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
> 
> and now I have a Japanese user who wants both identifiers and keywords
> localized in his native language.  The platform is .NET or Mono, and the
> input stream supports UTF-16.  I'd like to solve the problem once for all
> languages and not just Japanese and English.
> 
> Localizing the keywords should be simple enough.  They're different but
> fixed for each language.
> 
> The identifiers are tricker.  I need to exclude any members of the
> (localized) whitespace or ordinal number sets.  So at first I thought
> 
>                            ID  :    ~( WS | DIGIT | KSEP )+  ~( WS | KSEP )*
> 
>                            WS :     <localized list of whitespace codepoints>
>                            DIGIT :  <localized list of ordinal number codepoints>
>                            KSEP:    <localized thousands separator>

Note that localizing thousands and decimal separators is quite problematic,
because of the swapped usage of ',' and '.' in English-speaking vs European
locales, and because ',' may conflict with list separators. Even if you
don't need non-integers or lists now, you might need them in future.

I would advise sticking with the English-speaking usage here (as essentially
all programming languages/DSLs do), but allowing '_' as an internationalised
thousands separator. Localising it is not necessarily doing any favours
to non-English-speaking users, if the result is inconsistent with other
[non-natural] languages they are used to, or if it causes scripts to break
when changing locales.

> This turns out to be very naive, and I see this getting ugly fast.  Already
> I have to localize the DSL keywords so there's no way around writing
> multiple lexers.  So far I have only two languages: English and Japanese.
> But if this catches on, other users will want their own.  I'd like to
> minimize the number of lexers I need to maintain or at least maximize code
> reuse between them.

Again, I would recommend internationalisation over localisation.
Just use the syntax for identifiers defined by Unicode (or by Java;
they're almost the same).

The following code is adapted from Patrick Hulsmeijer's grammar for
ECMAScript 3, simplified a little. The original was BSD-licensed.
It has ordinary lexer rules for ASCII characters, but for non-ASCII
characters it uses semantic predicates that call Java's built-in
character classification methods (assuming the target language is Java).
This produces a much smaller and more efficient lexer, as well as a
simpler source grammar.

Note that with this approach, any non-ASCII keywords will lex as
identifiers.

@lexer::members {
  private static boolean isIdentifierStart(int c) {
    return Character.isJavaIdentifierStart(c);
  }

  private static boolean isIdentifierPart(int c) {
    // ZWNJ (\u200C) and ZWJ (\u200D) are also needed for some languages.
    return Character.isJavaIdentifierPart(c) || c == 0x200C || c == 0x200D;
  }
}

fragment IdentifierStartASCII
  : 'a'..'z' | 'A'..'Z' | '_'
  ;

fragment IdentifierPart
  : IdentifierStartASCII
  | '0'..'9'
  | { isIdentifierPart(input.LA(1)) }? { matchAny(); }
  ;

// This generates mIdentifierRest() used below.
fragment IdentifierRest
  : IdentifierPart*
  ;

Identifier
  : IdentifierStartASCII IdentifierRest
  | { if (!isIdentifierStart(input.LA(1))) throw new NoViableAltException();
      matchAny(); mIdentifierRest(); }
  ;

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com