[antlr-interest] Unicode in Lexer
Gaius Centus Novus
gaius.c.novus at gmail.com
Sun Aug 24 09:59:14 PDT 2008
I'm trying to build a lexer, and I'd like to include Unicode
characters. For example, I'd like to be able to do
NUMERIC: INTEGER | FLOAT | ... | IRRATIONAL | INFINITE;
IRRATIONAL: PI | ...;
PI: 'π'; // greek lowercase pi: Unicode U+03C0
INFINITE = SIGN? '∞'; // the infinity symbol: Unicode U+221E
Antlr gives me the following error on trying to build a lexer for that
[12:45:36] MyLexer.g:35:17: expecting ''', found '∞'
How do I represent unicode characters in the lexer?
In addition to these constants, I'd also like to be able to allow
arbitrary Unicode characters in certain rules. For example, where
Java has something like:
ID: 'a'..'z' ('a'..'z' | 'A'..'Z' | '_')*
I'd like to match any of the following (and infinitely many others):
'√' // square root: Unicode U+221A
'résumé' //resume with properly accented e's: U+00E9
Has anyone built rules that allow arbitrary code points? Or better
yet, some specific subsets?
I found a link to http://www.antlr.org/doc/lexer.html#unicode, but
that page no longer exists.
More information about the antlr-interest