[antlr-interest] Unicode handling
Mark Lentczner
markl at glyphic.com
Thu Apr 22 14:39:11 PDT 2004
On Apr 22, 2004, at 1:28 PM, Mike Lischke wrote:
> But for Unicode there is a general character class to identify whether
> a character belongs to an identifier in the programming language sense
> (http://www.unicode.org/reports/tr31/).
That report, and the Unicode concept of the identifier properties are
really considered starting points. The various character classes
created by it could be predefined (or imported) into an Antlr grammar,
but, as that report suggests, specific languages would then need to
compose their own NAME_START and NAME_CHAR rules from them. Alas,
since Antlr doesn't have set exclusion (i.e.: "NAME_CHAR:
UNICODE_IDENTIFIER_EXTEND - UNICODE_NUMBER;") it might be hard to
compose what one wants.
> Providing some predefined lexer rules (in the native lexer language)
> for such standard things
> like identifiers and numbers would simplify life for the grammar
> writer and would improve performance because there can be native
> access in the lexer to test characters for their class.
I like the idea of either, a) making a UnicodeLexer that one can
inherit from that would have these rules in them, or b) adding some
sort of syntax for selecting sets of characters based on the Unicode
properties. Something like: U<Alphabetic> or U<Hex_Digit> which would
be a match for all characters that have the property.
- Mark
Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list