[antlr-interest] Unicode handling
Mark Lentczner
markl at glyphic.com
Thu Apr 22 13:11:55 PDT 2004
> Hmm, handling of UTF-8 is more expensive when you need to process it
> (case folding, composition etc.) so this transformation format is not
> recommended to be used that way.
Well, that decision is dependent on quite a large number of factors and
varies greatly in different environments. For my current project,
UTF-8 works very well as an internal format for Unicode character
strings. Of course, the external view the system presents is
completely Unicode clean and makes no reference to the in-memory
encoding.
> In the case of antlr though it seems to fit well except that defining
> identifiers might be a bit unintuitive.
This could be automated in Antlr itself. See my suggestion in my post
of a few minutes ago. Of course, I'm not sure how many people using
Antlr and C++ would want to take the approach of specify a lexer over
Unicode, but generate a lexer taking UTF-8. On the other hand, given
the problems of handling Unicode in C++ (given the lack of a standard
way to do it, and the difficulty of incorporating external Unicode
libraries), perhaps this is how most people would want to do it. For
me, I see no practical alternative.
> For antlr, though, it would be great if there were some more generic
> support. Identifier start and middle chars, numbers/digits etc. could
> be predefined instead to have them to declare them over and over again
> in lexer grammars.
Were there only a general consensus on what these sets should be. One
could pre-define (or allow to be imported) classes based on the Unicode
properties lists. (And indeed, these could perhaps be internally
optimized to use some Unicode library rather than
tests/switches/bitsets.) Even in our project, where we want a great
deal of conformance with XML, we have to very carefully choose our
Identifier grammar and can't just use the Name production from XML (1.0
or 1.1).
- Mark
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/antlr-interest/
<*> To unsubscribe from this group, send an email to:
antlr-interest-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list