[antlr-interest] Lexer escape conversions

Fri Feb 23 18:31:35 PST 2007

I've been experimenting with using Antlr to create parsers for escape 
sequences. The test grammar I'm using is composed of words delimited by 
whitespace. A word is composed of upper/lower case letters and escape 
sequences. The escapes are composed of the '@' character followed by 
either 'a', 'b', or 'c' that represent the characters '@', '#', and '$' 
respectively.

This is easy, but clumsy, to accomplish in the parser as follows:
----------
/*
  * Parser Rules
  */

// List the words and drop the whitespace.
content
@init {System.out.println("Words:");}
     :   (
             w1=word             {System.out.println("  " + $w1.text);}
             (Whitespace w2=word {System.out.println("  " + $w2.text);})*
         )?
     ;

// Collect letters and escapes, but not whitespace, into words.
word : (Letters | Escape)+;

/*
  * Lexer Rules
  */

Letters    : Letter+;
Whitespace : (' ' | '\t' | '\u000B' | '\f' | '\r' | '\n')+;
Escape
     :    EscapeFlag (
              'a'  {text = "@";}
          |   'b'  {text = "#";}
          |   'c'  {text = "$";}
          )
     ;

fragment EscapeFlag : '@';
fragment Letter     : 'a'..'z' | 'A'..'Z';
----------

It would be cleaner to use the lexer to combine letters and converted 
escapes into word tokens. Since an Escape sequence cannot be a token, it 
must be a fragment that returns the converted value. The Word token rule 
can then append Letters text and Escape values into a single string. The 
parser just gets Word tokens that already include the converted escape 
values. The following should work:

----------
/*
  * Parser Rules
  */

content
@init {System.out.println("Words:");}
     :   (
             Word  {System.out.println("  " + $Word.text);}
         )*
     ;

/*
  * Lexer Rules
  */

// Gather sequences of letters and escapes between whitespace
// into words.
Word
@init {text = "";}
     :   (
             Letters  {text += $Letters.text;}
         |   Escape   {text += $Escape.value;}
         )+
     ;
Whitespace
     :   (' ' | '\t' | '\u000B' | '\f' | '\r' | '\n')+
         {$channel = HIDDEN;}
     ;

fragment Escape
returns [char value]
     :   EscapeFlag (
             'a'  {$value = '@';}
         |   'b'  {$value = '#';}
         |   'c'  {$value = '$';}
         )
     ;
fragment EscapeFlag : '@';
fragment Letters    : Letter+;
fragment Letter     : 'a'..'z' | 'A'..'Z';
----------

However, the generated lexer code for the Word token does not reference 
the returned $Escape.value character, instead it creates a new invalid 
token for the Escape fragment input substring and appends the token 
itself to the text String. For an input word "ju at bmps", you get 
"ju[@-1,24:25='@b',<0>,0:-1]ps" instead of "ju#mps".

----------
case 2 :
     // C:\\Data\\Projects\\Java\\Antlr\\Examples\\Exp3.g:30:5: Escape
     {
     int Escape2Start = getCharIndex();
     mEscape();
     Token Escape2 = new CommonToken(input, Token.INVALID_TOKEN_TYPE, 
Token.DEFAULT_CHANNEL, Escape2Start, getCharIndex()-1);
     text += Escape2;

     }
     break;
----------

Weird and oogly. Can anyone shed some light on what I'm doing wrong, a 
better way this can be done, or if this is a bug?

-- 
--------------------------------------------------------
"Any sufficiently over-complicated magic is indistinguishable from 
technology." -- Llelan D.