[antlr-interest] Unicode character classes

David Holroyd dave at badgers-in-foil.co.uk
Sun Mar 4 05:34:04 PST 2007


On Sun, Mar 04, 2007 at 02:15:42PM +0100, Johannes Luber wrote:
> is it possible to specify Unicode character classes like Zs, Lu, Ll, Lt,
> Lm, Lo, Nl and other without having to resort to spell each out every
> single character (as
> http://www.fileformat.info/info/unicode/category/Lu/list.htm shows, many
> characters aren't in a range)? If not, this would be a useful addition
> to ANTLR. In that regard, it seems one can't create arbitrary sets of
> tokens and exclude from those other arbitrary tokens. Or knows someone a
> way?

I want to have a 'unicode identifier'[1] as required by the ECMAscript
spec, and had though I might write some code to use ICU[2] to find all
the characters in the relevant unicode classes, and mechanically build
the ANTLR token definitions.

I haven't got around to implementing that yet though...


[1] http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf Sect 5.15
[2] http://icu.sourceforge.net/


-- 
http://david.holroyd.me.uk/


More information about the antlr-interest mailing list