[antlr-interest] case sensitivity for ANTLR v3 lexers

Tue May 16 12:00:41 PDT 2006

Ter:

Just my 2 cents, FWIW...

I don't think Antlr should concern itself with any of this.  Keep things as
simple as possible, and only do exact ordinal comparisons of strings.

If you need a case-insensitive lexer, the end user should implement that by
overriding an appropriate method in the lexer and use whatever comparison is
appropriate for their particular needs.  There is no universal upper or
lower-casing function that is appropriate for every possible locale, so why
even go down that road?   Keep the core of Antlr simple, and just provide
extensibility points (with samples) where appropriate.

This is probably a codegen issue more than a core Antlr issue, but one of
the biggest frustrations for me in Antlr 2.x is that the whole thing assumes
8-byte characters and strings.  There are hard-coded references to string,
stream, char, LPSTR, cout, etc. throughout the generated code as well as the
runtime code.  These should be defines or typedefs, so generating a Unicode
parser (UTF-16) would be as simple as doing something like '#define
ANTLR_STRING wstring', '#define ANTLR_CHAR wchar_t', and so on.

Another problem is the various hard-coded ANSI, English strings in error
messages, and hard coded references to cout.  Please abstract anything like
this so that it can be overridden, so error messages can be localized, and
other output mechanisms can be used other than an ANSI console.  It's a big
world out there and modern applications today need to support Unicode and
easy localization.

Don

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org 
> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Terence Parr
> Sent: Tuesday, May 16, 2006 2:28 PM
> To: ANTLR Interest
> Subject: Re: [antlr-interest] case sensitivity for ANTLR v3 lexers
> 
> 
> On May 16, 2006, at 10:58 AM, Terence Parr wrote:
> 
> >
> > On May 16, 2006, at 10:50 AM, Martin Probst wrote:
> >
> >>> Soon we will need case insensitive lexing for v3.  I am hoping to 
> >>> leave the input stream stuff alone and just subclass Lexer as 
> >>> CaseInsensitiveLexer, which overrides match() methods.  
> Then alter 
> >>> code gen for char set matching (because it's generated inline).
> >>>
> >>> The tokens would have the unmolested input chars.
> >>>
> >>> Does this sound right?
> >>
> >> No idea, but did you think about internationalization 
> issues? I mean, 
> >> in European languages there is a clear, defined concept of 
> upper case 
> >> and lower case. However I think there are some asian languages etc 
> >> where this is not exactly true, and
> >> java.lang.String#equalsIgnoreCase() doesn't get it right 
> as far as I 
> >> know. Maybe provide an overridable (ouch) method of some kind?
> >
> > If I override match(char c) so that it uses 
> Character.toUpperCase() or 
> > whatever, it should be ok I think.
> 
> We should also probably let people set the locale for the 
> uppercasing, right?
> 
> Ter
>