[antlr-interest] Unicode handling

Wed Apr 21 15:08:41 PDT 2004

My project's source files are Unicode, and we are using Antlr to 
generate the lexer, parser and compiler in C++.

Seems from the doc that Antlr isn't really ready to deal with the full 
compliment of Unicode characters.  I found references to problems with 
EOF (integer -1, typecast to 0xFFFF as a character), problems with 
character sets (getting very large), and it seems that it assumes that 
Unicode characters are only 16 bits (which is no longer true.)

So, rather than try to work around or fix these problems, I intend to 
make my tool chain work with UTF-8 encoded source.  (This is especially 
easy for us, since the the process feeding the source stream already 
normalizes the incoming character set to UTF-8.)

Instead of parsing Unicode:

NAME_START_CHAR:
     ':' | 'A'-'Z' | '_' | 'a'-'z'
     | '\u00C0'-'\u00D6'
     | '\u00D8'-'\u00F6'
     | '\u00F8'-'\u02FF'
     | '\u0370'-'\u037D'
     | '\u037F'-'\u1FFF'
     | '\u200C'-'\u200D'
     | '\u2070'-'\u218F'
     | '\u2C00'-'\u2FEF'
     | '\u3001'-'\uD7FF'
     | '\uF900'-'\uFDCF'
     | '\uFDF0'-'\uFFFD'
     | '\u10000'-'\uEFFFF'	// won't work in Antlr as it can't handle 
these Unicode chars
     ;

We'd be parsing the UTF-8 encoded version of these characters:

NAME_START_CHAR:
     ':' | 'A'..'Z' | '_' | 'a'..'z'
     | '\u00C3' '\u0080'..'\u0096'           // '\u00C0'-'\u00D6'
     | '\u00C3' '\u0098'..'\u00B6'           // '\u00D8'-'\u00F6'
     | '\u00C3' '\u00B8'..'\u00BF'           // '\u00F8'-'\u00FF'
     | '\u00C4'..'\u00CB' '\u0080'..'\u00BF' // '\u0100'-'\u02FF'
     | '\u00CD' '\u00B0'..'\u00BD'           // '\u0370'-'\u037D'
     | '\u00CD' '\u00BF'                     // '\u037F'
     | '\u00CE'..'\u00DF' '\u0080'..'\u00BF' // '\u0380'-'\u07FF'
     | '\u00E0' '\u00A0'..'\u00BF' '\u0080'..'\u00BF'    // 
'\u0800'-'\u0FFF'
     | '\u00E1' '\u0090'..'\u00BF' '\u0080'..'\u00BF'    // 
'\u1000'-'\u1FFF'
     ... and so on ...
     ;

Does anyone see any pitfalls to this other than increasing the look 
ahead for the lexer?  Since in our source language, all the meaningful 
punctuation is in the visible US-ASCII range, the only place the 
difference between parsing Unicode characters vs. UTF-8 encoded Unicode 
characters would be in things like the NAME token production.

This seems much more preferable to me than extending the C++ support 
with some Unicode library (like IBM's icu or some such).

- Mark

Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/