[antlr-interest] Q: Advice on localizing lexer

Sun Jun 14 16:01:50 PDT 2009

Thanks for your detailed reply.  I apologize for "top posting" but you make
some great points and I want to reply to a coupe of them, below...

On Sun, Jun 14, 2009 at 1:28 PM, David-Sarah Hopwood <
david-sarah at jacaranda.org> wrote:

>
> Note that localizing thousands and decimal separators is quite problematic,
> because of the swapped usage of ',' and '.' in English-speaking vs European
> locales, and because ',' may conflict with list separators. Even if you
> don't need non-integers or lists now, you might need them in future.

This actually is the easiest part.

> I would advise sticking with the English-speaking usage here (as
> essentially
> all programming languages/DSLs do), but allowing '_' as an
> internationalised
> thousands separator. Localising it is not necessarily doing any favours
> to non-English-speaking users, if the result is inconsistent with other
> [non-natural] languages they are used to, or if it causes scripts to break
> when changing locales.
>

I'm trying to convince my users of the same.  But some customers can be
quite "persistent" you know.
I believe (proof omitted) that if I can localize, then I also have all the
information needed to write a language translator such that a runnable
script in one dialect would be both runnable and semantically equivalent in
all dialects.  The only requirement really is that the mapping between
localized token representations be one-to-one.  So you see what really makes
this problem interesting (at least to me) is that I'm really trying to teach
a lexer some semantics.

Again, I would recommend internationalisation over localisation.
> Just use the syntax for identifiers defined by Unicode (or by Java;
> they're almost the same).
>

This is probably the most practical thing to do, I agree.

> The following code is adapted from Patrick Hulsmeijer's grammar for
> ECMAScript 3, simplified a little. The original was BSD-licensed.
> It has ordinary lexer rules for ASCII characters, but for non-ASCII
> characters it uses semantic predicates that call Java's built-in
> character classification methods (assuming the target language is Java).
> This produces a much smaller and more efficient lexer, as well as a
> simpler source grammar.

Thanks!

> Note that with this approach, any non-ASCII keywords will lex as
> identifiers.

Yes.  I thought of that right away.  That can be dealt with by lexer
precedence rules.

Thank you very much again for your insight.  I'm trying to get my users to
accept internationalized identifiers and labels and keep the keywords and
numerics and tuple/list identifiers fixed to the English versions.

Carlos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20090614/7589549e/attachment.html