[antlr-interest] Unicode in Lexer

Sun Aug 24 10:57:43 PDT 2008

Gaius Centus Novus schrieb:
> I'm trying to build a lexer, and I'd like to include Unicode
> characters.  For example, I'd like to be able to do
> 
> NUMERIC: INTEGER | FLOAT | ... | IRRATIONAL | INFINITE;
> 
> ...
> 
> IRRATIONAL: PI | ...;
> 
> PI: 'π';   // greek lowercase pi: Unicode U+03C0
> 
> INFINITE = SIGN? '∞';  // the infinity symbol: Unicode U+221E
> 
> Antlr gives me the following error on trying to build a lexer for that
> last rule:
> [12:45:36] MyLexer.g:35:17: expecting ''', found '∞'
>  at org.antlr.tool.ANTLRLexer.nextToken(ANTLRLexer.java:321)
>  ...
> 
> How do I represent unicode characters in the lexer?
> 
> In addition to these constants, I'd also like to be able to allow
> arbitrary Unicode characters in certain rules.  For example, where
> Java has something like:
> 
> ID: 'a'..'z' ('a'..'z' | 'A'..'Z' | '_')*
> 
> I'd like to match any of the following (and infinitely many others):
> 
> '√' // square root: Unicode U+221A
> 'résumé'  //resume with properly accented e's: U+00E9
> etc.
> 
> Has anyone built rules that allow arbitrary code points?  Or better
> yet, some specific subsets?
> 
> I found a link to http://www.antlr.org/doc/lexer.html#unicode, but
> that page no longer exists.
> 
> Thanks,
> Gaius

Well, the ANTLR tool itself does so far consume only the 7-bit ASCII
char page and can't handle unicode. But lexers can recognize unicode
characters natively, when they are specified via '\uxxxx'. There is no
direct support for chars beyound '\u10000', as Java doesn't support
them, but you can create rules which scan for the surrogate char range
and translate them later (if your target is e.g. C#).

Johannes