[antlr-interest] About literal supports unicode
Sam Barnett-Cormack
s.barnett-cormack at lancaster.ac.uk
Sat Jun 13 08:21:12 PDT 2009
Ha Luong wrote:
> Dear all,
>
> I tried to use the grammar for accepting the unicode string as follow:
> //modify T.g in the example source of ANTLR book
> grammar T;
> options {
> language=Java;
> }
> @members {
> String s;
> }
> r : ID '#' {s = $ID.text; System.out.println("found "+s);} ;
> ID: ('a'..'z'|'\u00e0')+ ; //\u00e0
> WS: (' '|'\n'|'\r')+ {skip();} ; // ignore whitespace
>
> and do these commands in cygwin:
> java org.antlr.Tool T.g
> javac *.java
>
> If I test the literal 'a', it is ok
> java Test
> a #
> ^Z
> found a
>
> but the literal 'à', it has error:
> java Test
> à
> #
> ^Z
> line 1:0 no viable alternative at character 'à'
> line 2:0 missing ID at '#'
> found <missing ID>
The question that immediately occurs is whether your 'à' is actually a
00e0, or is it an 0300+0061? Not sure whether this would make a
difference (the standard seems a little foggy as to whether
implementations should consider them identical), but it's a question
that comes immediately to mind.
Sam
More information about the antlr-interest
mailing list