[antlr-interest] Unicode handling

Thu Apr 22 13:28:11 PDT 2004

> Were there only a general consensus on what these sets should 
> be.  One could pre-define (or allow to be imported) classes 
> based on the Unicode properties lists.  (And indeed, these 
> could perhaps be internally optimized to use some Unicode 
> library rather than
> tests/switches/bitsets.)  Even in our project, where we want 
> a great deal of conformance with XML, we have to very 
> carefully choose our Identifier grammar and can't just use 
> the Name production from XML (1.0 or 1.1).

I've got the impression that most of the time there *is* actually an agreement about what constitutes an identifier. For
those who need a different definition there is still the classic way in the lexer. But for Unicode there is a general
character class to identify whether a character belongs to an identifier in the programming language sense
(http://www.unicode.org/reports/tr31/). For numbers the task becomes a bit more difficult because the western (actually
arabian) digits are not enough to define a number in all languages, so you would need character classes in this and
certainly other cases. Providing some predefined lexer rules (in the native lexer language) for such standard things
like identifiers and numbers would simplify life for the grammar writer and would improve performance because there can
be native access in the lexer to test characters for their class.

Mike
--
www.soft-gems.net

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/