[antlr-interest] Unicode character classes
David Holroyd
dave at badgers-in-foil.co.uk
Sun Mar 4 05:34:04 PST 2007
On Sun, Mar 04, 2007 at 02:15:42PM +0100, Johannes Luber wrote:
> is it possible to specify Unicode character classes like Zs, Lu, Ll, Lt,
> Lm, Lo, Nl and other without having to resort to spell each out every
> single character (as
> http://www.fileformat.info/info/unicode/category/Lu/list.htm shows, many
> characters aren't in a range)? If not, this would be a useful addition
> to ANTLR. In that regard, it seems one can't create arbitrary sets of
> tokens and exclude from those other arbitrary tokens. Or knows someone a
> way?
I want to have a 'unicode identifier'[1] as required by the ECMAscript
spec, and had though I might write some code to use ICU[2] to find all
the characters in the relevant unicode classes, and mechanically build
the ANTLR token definitions.
I haven't got around to implementing that yet though...
[1] http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf Sect 5.15
[2] http://icu.sourceforge.net/
--
http://david.holroyd.me.uk/
More information about the antlr-interest
mailing list