[antlr-interest] case sensitivity for ANTLR v3 lexers

Wed May 24 13:02:28 PDT 2006

I believe that the ICU package is about as close as can be on that
issue. But in general convering the simpler languages that java handles
naturally would be enough. Very special cases can be hand crafted by the
implementer, but most cases are not like this.

The C runtime for ANLTR3 uses UTF32 internally and expects any character
stream to provide its own conversion from the original input character
set (which is preserved in the original format however). I have provided
latin-1 and will provide a few other input streams for people to use as
templates for anything else. Typedefs are used for pointers so that you
can specifically say if the source is 8 bit characters, UTF32 etc.
ANTLR3_foobar and so on.

Also, don't forget that though we may want a case insensitive match, the
original case of the matching text is probably what is required for the
token.

I have not finished error handling yet (C) but in fact it will be easy
to override the output methods and so on as it will get an error message
code which you can use to do whatever you like.

Jim

-----Original Message-----
From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-

> If you need a case-insensitive lexer, the end user should implement
that by
overriding an appropriate method in the lexer and use whatever
comparison is
appropriate for their particular needs.  There is no universal upper or
lower-casing function that is appropriate for every possible locale, so
why
even go down that road?   Keep the core of Antlr simple, and just
provide
extensibility points (with samples) where appropriate.

This is probably a codegen issue more than a core Antlr issue, but one
of
the biggest frustrations for me in Antlr 2.x is that the whole thing
assumes
8-byte characters and strings.  There are hard-coded references to
string,
stream, char, LPSTR, cout, etc. throughout the generated code as well as
the
runtime code.  These should be defines or typedefs, so generating a
Unicode
parser (UTF-16) would be as simple as doing something like '#define
ANTLR_STRING wstring', '#define ANTLR_CHAR wchar_t', and so on.

Another problem is the various hard-coded ANSI, English strings in error
messages, and hard coded references to cout.  Please abstract anything
like
this so that it can be overridden, so error messages can be localized,
and
other output mechanisms can be used other than an ANSI console.  It's a
big
world out there and modern applications today need to support Unicode
and
easy localization.

Don

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org 
> [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Terence Parr
> Sent: Tuesday, May 16, 2006 2:28 PM
> To: ANTLR Interest
> Subject: Re: [antlr-interest] case sensitivity for ANTLR v3 lexers
> 
> 
> On May 16, 2006, at 10:58 AM, Terence Parr wrote:
> 
> >
> > On May 16, 2006, at 10:50 AM, Martin Probst wrote:
> >
> >>> Soon we will need case insensitive lexing for v3.  I am hoping to 
> >>> leave the input stream stuff alone and just subclass Lexer as 
> >>> CaseInsensitiveLexer, which overrides match() methods.  
> Then alter 
> >>> code gen for char set matching (because it's generated inline).
> >>>
> >>> The tokens would have the unmolested input chars.
> >>>
> >>> Does this sound right?
> >>
> >> No idea, but did you think about internationalization 
> issues? I mean, 
> >> in European languages there is a clear, defined concept of 
> upper case 
> >> and lower case. However I think there are some asian languages etc 
> >> where this is not exactly true, and
> >> java.lang.String#equalsIgnoreCase() doesn't get it right 
> as far as I 
> >> know. Maybe provide an overridable (ouch) method of some kind?
> >
> > If I override match(char c) so that it uses 
> Character.toUpperCase() or 
> > whatever, it should be ok I think.
> 
> We should also probably let people set the locale for the 
> uppercasing, right?
> 
> Ter
>