[antlr-interest] Parsing unicode specifications

Tue Jul 20 15:10:18 PDT 2010

Hi,

I've been trying to use the string grammar which antlrworks provides as a
template, to allow users to specify unicode and escaped characters from an
ASCII-only input.

What I don't get, is that all the escape characters simply end up in the
STRING token as ordinary characters, which is no different to not bothering
with all the escape sequence and unicode specifications at all:

The grammar which antlrworks provides if you tick the string box is this:

grammar testString;

string : STRING ;

STRING
    :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

But this grammar seems to produce identical results - a string token simply
containing whatever is between the double quotes:

grammar testString2;

string : STRING ;

STRING
    :  '"' ~('"')* '"'  ;

So my question is, how do I actually do something with those escape and
unicode fragments, to actually assemble a string containing the escaped or
unicode characters, rather than just ending up with a string containing
whatever was between the double quotes?

Regards,

Matt Palmer.