[antlr-interest] case sensitivity for ANTLR v3 lexers

Tue May 16 12:53:18 PDT 2006

Terence Parr wrote:
> 
> On May 16, 2006, at 10:50 AM, Martin Probst wrote:
> 
>>> Soon we will need case insensitive lexing for v3.  I am hoping to 
>>> leave the input stream stuff alone and just subclass Lexer as 
>>> CaseInsensitiveLexer, which overrides match()
>>> methods.  Then alter code gen for char set matching (because it's 
>>> generated inline).
>>>
>>> The tokens would have the unmolested input chars.
>>>
>>> Does this sound right?
>>
>>
>> No idea, but did you think about internationalization issues? I  mean,
>> in European languages there is a clear, defined concept of  upper case
>> and lower case. However I think there are some asian  languages etc
>> where this is not exactly true, and 
>> java.lang.String#equalsIgnoreCase() doesn't get it right as far as  I
>> know. Maybe provide an overridable (ouch) method of some kind?
> 
> 
> If I override match(char c) so that it uses Character.toUpperCase()  or
> whatever, it should be ok I think.

IIRC, Unicode itself has case definition stuff, so *if* (and it's a big
if) the version of Java in use supports that correctly, then
toUpperCase() and appropriate locale should all work nicely. Just a very
big if.

And to avoid posting twice, I think case-insensitivity is really a good
idea, as it's not too unusual to be faced with very imprecise docs to
work to. I've certainly been there before. Plus it's not inconceivable
that people will use ANTLR for some very loose grammars and parsing
tasks and want to match "Abstract:", "ABSTRACT:" or "abstract:", for
example - another case from my own experience, although I wasn't using
ANTLR or anything similar at the time, being young and foolish.

Sam