[antlr-interest] New Guy Question...

Wed Jun 8 22:44:07 PDT 2011

Note that matching in terms of UPPER case is generally a bad idea. There are languages with characters that do not appear at the start of words. As upper case has come to be primarily used to indicate the start of words in selective contexts, such characters need not have a proper mapping to upper case. The German ß is the best known such character in languages with latin based character sets, but it is not the only such example. However if a language has a notion of case, there is always a mapping to lower case and for simple case folding that is to be preferred.

In many ways the problem of dealing with case is similar to the problem of dealing with normalization, where the same character can be represented by more than one combination of code points. As part of its process of dealing with normalization, for programming languages the UNICODE consortium recommended a couple of straightforward means of dealing identifier uniqueness.These are covered in "Unicode Standard Annex #31, Unicode Identifier and Pattern Syntax"
http://www.unicode.org/reports/tr31/
These have a straightforward implementation in terms of the UNICODE character property tables, and it is a small matter of programming to implement their lexical classes for identifiers.

On Jun 6, 2011, at 4:56 PM, Jim Idle wrote:

> No, that is not correct, please look at the WIKI article. The input stream
> merely MATCHES in upper case, it does NOT change the input stream itself,
> hence both the keywords and anything else are case preserved when you ask
> for their text; that is the whole point of doing it that way. Then you
> specify the tokens in the lexer using upper case only and it has the side
> effect of simplifying the lexer rules as well as not creating a method
> call to match every letter of every keyword (which is a bad idea even with
> JIT inlining).
> 
> Jim
> 
>> -----Original Message-----
>> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
>> bounces at antlr.org] On Behalf Of Douglas Godfrey
>> Sent: Monday, June 06, 2011 12:41 PM
>> To: Marco Hunsicker
>> Cc: antlr-interest at antlr.org
>> Subject: Re: [antlr-interest] New Guy Question...
>> 
>> When you implement case insensitive keywords, you may still want case
>> sensitive identifiers.
>> If the input stream does case folding, you can't use case sensitive
>> identifiers.
>> 
>> On Sun, Jun 5, 2011 at 5:58 PM, Marco Hunsicker <devel at hunsicker.de>
>> wrote:
>> 
>>>> You have to handle case insensitivity the hard way:
>>>> 
>>>> fragment A
>>>>     :    'A' | 'a';
>>>> 
>>>> [...]
>>> 
>>> I don't think it's a necessity to do it this way. Actually, I think
>> it
>>> would be better using a specialized input stream that does any
>>> necessary transformation. Your mileage may vary ;)
>>> 
>>> Cheers,
>>> 
>>> Marco
>>> 
>>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>>> Unsubscribe:
>>> http://www.antlr.org/mailman/options/antlr-interest/your-email-
>> address
>>> 
>> 
>> List: http://www.antlr.org/mailman/listinfo/antlr-interest
>> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-
>> email-address
> 
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address