[antlr-interest] Lexing C-style strings - problems matching
characters not in vocab
Jeff Barnes
jbarnesweb at yahoo.com
Sat Feb 25 17:52:06 PST 2006
I recently had to convert IBM EBCDIC character sets to
Java's Unicode. I implemented a
java.nio.charset.spi.CharsetProvider that used
Unicode's Charset ML to map characters (see
http://www.unicode.org/unicode/reports/tr22/).
If this helps, I'd be happy to post code.
Regards,
Jeff
--- Martin Probst <mail at martin-probst.com> wrote:
> Hi,
>
> > I don't think my STRING rule with match characters
> such as £, ©, ¼
> > and so on. What do I do about this? Add them
> explicitly to the
> > expression? I can't go through the entire Unicode
> specs adding every
> > character to my rule - it would be huge.
> >
> > I looked at Scanning Unicode Characters in the
> docs, but this only
> > refers to 16bit Unicode characters - what do I do
> for characters
> > outside this arbitary limit?
>
> Well, to Java 16bit is all Unicode. For Java
> characters outside of the
> 16bit range are represented by surrogate pairs, as
> far as I know. As
> someone else said, you can use the charVocabulary
> option to include more
> characters. If you use C++, it's going to be more
> difficult though.
>
> Writing down all Unicode characters actually isn't
> that horrible. You
> can use character ranges and end up with something
> sensible. I once ran
> into a problem though when entering the character
> ranges from the XML
> standard - the ANTLR generated arrays got too big
> for the Java compiler
> and classformat :-(
>
> Martin
>
>
=========
Jeff Barnes
(206)245-6100
Few things are impossible to diligence and skill.
--- Samuel Johnson (Rasselas Chap. xii.)
More information about the antlr-interest
mailing list