[antlr-interest] Q: Advice on localizing lexer
David-Sarah Hopwood
david-sarah at jacaranda.org
Sun Jun 14 13:28:38 PDT 2009
C. Mundi wrote:
> I have created a very simple DSL with antlr for pseudo natural language.
> Nothing special.
>
> It currently recognizes the usual identifiers:
>
> ID : ('a'..'z'|'A'..'Z'|'_')
> ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
>
> and now I have a Japanese user who wants both identifiers and keywords
> localized in his native language. The platform is .NET or Mono, and the
> input stream supports UTF-16. I'd like to solve the problem once for all
> languages and not just Japanese and English.
>
> Localizing the keywords should be simple enough. They're different but
> fixed for each language.
>
> The identifiers are tricker. I need to exclude any members of the
> (localized) whitespace or ordinal number sets. So at first I thought
>
> ID : ~( WS | DIGIT | KSEP )+ ~( WS | KSEP )*
>
> WS : <localized list of whitespace codepoints>
> DIGIT : <localized list of ordinal number codepoints>
> KSEP: <localized thousands separator>
Note that localizing thousands and decimal separators is quite problematic,
because of the swapped usage of ',' and '.' in English-speaking vs European
locales, and because ',' may conflict with list separators. Even if you
don't need non-integers or lists now, you might need them in future.
I would advise sticking with the English-speaking usage here (as essentially
all programming languages/DSLs do), but allowing '_' as an internationalised
thousands separator. Localising it is not necessarily doing any favours
to non-English-speaking users, if the result is inconsistent with other
[non-natural] languages they are used to, or if it causes scripts to break
when changing locales.
> This turns out to be very naive, and I see this getting ugly fast. Already
> I have to localize the DSL keywords so there's no way around writing
> multiple lexers. So far I have only two languages: English and Japanese.
> But if this catches on, other users will want their own. I'd like to
> minimize the number of lexers I need to maintain or at least maximize code
> reuse between them.
Again, I would recommend internationalisation over localisation.
Just use the syntax for identifiers defined by Unicode (or by Java;
they're almost the same).
The following code is adapted from Patrick Hulsmeijer's grammar for
ECMAScript 3, simplified a little. The original was BSD-licensed.
It has ordinary lexer rules for ASCII characters, but for non-ASCII
characters it uses semantic predicates that call Java's built-in
character classification methods (assuming the target language is Java).
This produces a much smaller and more efficient lexer, as well as a
simpler source grammar.
Note that with this approach, any non-ASCII keywords will lex as
identifiers.
@lexer::members {
private static boolean isIdentifierStart(int c) {
return Character.isJavaIdentifierStart(c);
}
private static boolean isIdentifierPart(int c) {
// ZWNJ (\u200C) and ZWJ (\u200D) are also needed for some languages.
return Character.isJavaIdentifierPart(c) || c == 0x200C || c == 0x200D;
}
}
fragment IdentifierStartASCII
: 'a'..'z' | 'A'..'Z' | '_'
;
fragment IdentifierPart
: IdentifierStartASCII
| '0'..'9'
| { isIdentifierPart(input.LA(1)) }? { matchAny(); }
;
// This generates mIdentifierRest() used below.
fragment IdentifierRest
: IdentifierPart*
;
Identifier
: IdentifierStartASCII IdentifierRest
| { if (!isIdentifierStart(input.LA(1))) throw new NoViableAltException();
matchAny(); mIdentifierRest(); }
;
--
David-Sarah Hopwood ⚥ http://davidsarah.livejournal.com
More information about the antlr-interest
mailing list