[antlr-interest] Case sensitivity warnings

Sun Mar 28 11:32:52 PST 2004

Hi all,

As you can see I'm heavily working with ANTLR and while I'm trying to build a grammar for files with Unicode format I came accross a general problem, which might particularly interest Terence et al.

As you surely know Unicode input might contain a lot of different characters from different cultures. Many of the Unicode code points, however, are not yet assigned, even in the BMP. So a correct alphabet should not simply include '\u0000'..'\ufffe', but is rather complex and looks in my grammar so:

protected
  UNICODE_LETTER:
      '\u0041' .. '\u005A' | '\u0061' .. '\u007A' | '\u00AA' .. '\u00AA' | '\u00B5' .. '\u00B5' |
      '\u00BA' .. '\u00BA' | '\u00C0' .. '\u00D6' | '\u00D8' .. '\u00F6' | '\u00F8' .. '\u01F5' |
      '\u01FA' .. '\u0217' | '\u0250' .. '\u02A8' | '\u02B0' .. '\u02B8' | '\u02BB' .. '\u02C1' |
      '\u02D0' .. '\u02D1' | '\u02E0' .. '\u02E4' | '\u037A' .. '\u037A' | '\u0386' .. '\u0386' |
      '\u0388' .. '\u038A' | '\u038C' .. '\u038C' | '\u038E' .. '\u03A1' | '\u03A3' .. '\u03CE' |
      '\u03D0' .. '\u03D6' | '\u03DA' .. '\u03DA' | '\u03DC' .. '\u03DC' | '\u03DE' .. '\u03DE' |
      '\u03E0' .. '\u03E0' | '\u03E2' .. '\u03F3' | '\u0401' .. '\u040C' | '\u040E' .. '\u044F' |
...
      '\uFFC2' .. '\uFFC7' | '\uFFCA' .. '\uFFCF' | '\uFFD2' .. '\uFFD7' | '\uFFDA' .. '\uFFDC';

This rule will however cause a warning about using only lower case characters when case sensitivity is switched off. I think it is time to abolish this message and the check behind it. The reason is that you can easily determine whether the rule includes both upper and lower case letters for the ASCII range of letters. Unfortunately, things become much more difficult if you consider all Unicode characters and their culturally correct casing. Sometimes it is even more complex. Consider for instance the german letter ß (sharp s), which does not have an upper case equivalent but an alternative 'ss'. Similar cases exist for greek and many other languages (keyword: case folding). 

You *could* implement case checking the correct way e.g. by using IBM's ICU implementation, but I think the result is not worth the effort (with regard to issuing the warning message). Although, in the long run you won't come around to use this or a similar library if you want correct Unicode support without case sensitivity, because you need it to case-fold input characters culturally correct then.

Mike
--
www.soft-gems.net

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/