[antlr-interest] Unicode handling

Mark Lentczner markl at glyphic.com
Thu Apr 22 13:11:55 PDT 2004


> Hmm, handling of UTF-8 is more expensive when you need to process it 
> (case folding, composition etc.) so this transformation format is not 
> recommended to be used that way.
Well, that decision is dependent on quite a large number of factors and 
varies greatly in different environments.  For my current project, 
UTF-8 works very well as an internal format for Unicode character 
strings.  Of course, the external view the system presents is 
completely Unicode clean and makes no reference to the in-memory 
encoding.

> In the case of antlr though it seems to fit well except that defining 
> identifiers might be a bit unintuitive.
This could be automated in Antlr itself.  See my suggestion in my post 
of a few minutes ago.  Of course, I'm not sure how many people using 
Antlr and C++ would want to take the approach of specify a lexer over 
Unicode, but generate a lexer taking UTF-8.  On the other hand, given 
the problems of handling Unicode in C++ (given the lack of a standard 
way to do it, and the difficulty of incorporating external Unicode 
libraries), perhaps this is how most people would want to do it.  For 
me, I see no practical alternative.

> For antlr, though, it would be great if there were some more generic 
> support. Identifier start and middle chars, numbers/digits etc. could 
> be predefined instead to have them to declare them over and over again 
> in lexer grammars.
Were there only a general consensus on what these sets should be.  One 
could pre-define (or allow to be imported) classes based on the Unicode 
properties lists.  (And indeed, these could perhaps be internally 
optimized to use some Unicode library rather than 
tests/switches/bitsets.)  Even in our project, where we want a great 
deal of conformance with XML, we have to very carefully choose our 
Identifier grammar and can't just use the Name production from XML (1.0 
or 1.1).

- Mark



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/
 



More information about the antlr-interest mailing list