[antlr-interest] Unicode handling

Thu Apr 22 14:39:11 PDT 2004

On Apr 22, 2004, at 1:28 PM, Mike Lischke wrote:
> But for Unicode there is a general character class to identify whether 
> a character belongs to an identifier in the programming language sense 
> (http://www.unicode.org/reports/tr31/).
That report, and the Unicode concept of the identifier properties are 
really considered starting points.  The various character classes 
created by it could be predefined (or imported) into an Antlr grammar, 
but, as that report suggests, specific languages would then need to 
compose their own NAME_START and NAME_CHAR rules from them.  Alas, 
since Antlr doesn't have set exclusion (i.e.: "NAME_CHAR: 
UNICODE_IDENTIFIER_EXTEND -  UNICODE_NUMBER;") it might be hard to 
compose what one wants.

> Providing some predefined lexer rules (in the native lexer language) 
> for such standard things
> like identifiers and numbers would simplify life for the grammar 
> writer and would improve performance because there can be native 
> access in the lexer to test characters for their class.
I like the idea of either, a) making a UnicodeLexer that one can 
inherit from that would have these rules in them, or b) adding some 
sort of syntax for selecting sets of characters based on the Unicode 
properties.  Something like: U<Alphabetic> or U<Hex_Digit> which would 
be a match for all characters that have the property.

	- Mark

Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/