[antlr-interest] Lexing C-style strings - problems matching characters not in vocab

Sat Feb 25 17:52:06 PST 2006

I recently had to convert IBM EBCDIC character sets to
Java's Unicode. I implemented a
java.nio.charset.spi.CharsetProvider that used
Unicode's Charset ML to map characters (see
http://www.unicode.org/unicode/reports/tr22/).

If this helps, I'd be happy to post code.

Regards,
Jeff

--- Martin Probst <mail at martin-probst.com> wrote:

> Hi,
> 
> > I don't think my STRING rule with match characters
> such as £, ©, ¼  
> > and so on. What do I do about this? Add them
> explicitly to the  
> > expression? I can't go through the entire Unicode
> specs adding every  
> > character to my rule - it would be huge.
> > 
> > I looked at Scanning Unicode Characters in the
> docs, but this only  
> > refers to 16bit Unicode characters - what do I do
> for characters  
> > outside this arbitary limit?
> 
> Well, to Java 16bit is all Unicode. For Java
> characters outside of the
> 16bit range are represented by surrogate pairs, as
> far as I know. As
> someone else said, you can use the charVocabulary
> option to include more
> characters. If you use C++, it's going to be more
> difficult though.
> 
> Writing down all Unicode characters actually isn't
> that horrible. You
> can use character ranges and end up with something
> sensible. I once ran
> into a problem though when entering the character
> ranges from the XML
> standard - the ANTLR generated arrays got too big
> for the Java compiler
> and classformat :-(
> 
> Martin
> 
> 

=========
Jeff Barnes
(206)245-6100

Few things are impossible to diligence and skill.
--- Samuel Johnson (Rasselas Chap. xii.)