[antlr-interest] Unicode in Lexer

Sun Aug 24 09:59:14 PDT 2008

I'm trying to build a lexer, and I'd like to include Unicode
characters.  For example, I'd like to be able to do

NUMERIC: INTEGER | FLOAT | ... | IRRATIONAL | INFINITE;

...

IRRATIONAL: PI | ...;

PI: 'π';   // greek lowercase pi: Unicode U+03C0

INFINITE = SIGN? '∞';  // the infinity symbol: Unicode U+221E

Antlr gives me the following error on trying to build a lexer for that
last rule:
[12:45:36] MyLexer.g:35:17: expecting ''', found '∞'
 at org.antlr.tool.ANTLRLexer.nextToken(ANTLRLexer.java:321)
 ...

How do I represent unicode characters in the lexer?

In addition to these constants, I'd also like to be able to allow
arbitrary Unicode characters in certain rules.  For example, where
Java has something like:

ID: 'a'..'z' ('a'..'z' | 'A'..'Z' | '_')*

I'd like to match any of the following (and infinitely many others):

'√' // square root: Unicode U+221A
'résumé'  //resume with properly accented e's: U+00E9
etc.

Has anyone built rules that allow arbitrary code points?  Or better
yet, some specific subsets?

I found a link to http://www.antlr.org/doc/lexer.html#unicode, but
that page no longer exists.

Thanks,
Gaius