[antlr-interest] Q: Advice on localizing lexer

C. Mundi cmundi at gmail.com
Sat Jun 13 22:12:54 PDT 2009


Hi.

Has anyone here faced or solved this problem before?  I a novice with
Unicode, much less human languages.

I have created a very simple DSL with antlr for pseudo natural language.
Nothing special.

It currently recognizes the usual identifiers:

                                ID  :    ('a'..'z'|'A'..'Z'|'_')
('a'..'z'|'A'..'Z'|'0'..'9'|'_')*

and now I have a Japanese user who wants both identifiers and keywords
localized in his native language.  The platform is .NET or Mono, and the
input stream supports UTF-16.  I'd like to solve the problem once for all
languages and not just Japanese and English.

Localizing the keywords should be simple enough.  They're different but
fixed for each language.

The identifiers are tricker.  I need to exclude any members of the
(localized) whitespace or ordinal number sets.  So at first I thought

                               ID  :    ~( WS | DIGIT | KSEP )+  ~( WS |
KSEP )*


                                WS :     <localized list of whitespace
codepoints>
                                DIGIT :  <localized list of ordinal number
codepoints>
                                KSEP:  <localized thousands separator>

This turns out to be very naive, and I see this getting ugly fast.  Already
I have to localize the DSL keywords so there's no way around writing
multiple lexers.  So far I have only two languages: English and Japanese.
But if this catches on, other users will want their own.  I'd like to
minimize the number of lexers I need to maintain or at least maximize code
reuse between them.

I figure this question must come up for DSL's pretty regularly.  Although we
more or less accept using a subset of  Latin characters -- and usually just
ASCII -- for *general* purpose programming, the use case for DSL's almost
begs for localized identifiers and keywords.  The users in this case or
ordinary business people, not programmers.

Any advice?

Thanks,
-CM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090613/54efcc57/attachment.html 


More information about the antlr-interest mailing list