[antlr-interest] About literal supports unicode

Sat Jun 13 08:21:12 PDT 2009

Ha Luong wrote:
> Dear all,
> 
> I tried to use the grammar for accepting the unicode string as follow:
> //modify T.g in the example source of ANTLR book
> grammar T;
> options {
>     language=Java;
> }
> @members {
> String s;
> }
> r : ID '#' {s = $ID.text; System.out.println("found "+s);} ;
> ID: ('a'..'z'|'\u00e0')+ ; //\u00e0
> WS: (' '|'\n'|'\r')+ {skip();} ; // ignore whitespace
> 
> and do these commands in cygwin:
> java org.antlr.Tool T.g
> javac *.java
> 
> If I test the literal 'a', it is ok
> java Test
> a #
> ^Z
> found a
> 
> but the literal 'à', it has error:
> java Test
> à
> #
> ^Z
> line 1:0 no viable alternative at character 'à'
> line 2:0 missing ID at '#'
> found <missing ID>

The question that immediately occurs is whether your 'à' is actually a 
00e0, or is it an 0300+0061? Not sure whether this would make a 
difference (the standard seems a little foggy as to whether 
implementations should consider them identical), but it's a question 
that comes immediately to mind.

Sam