[antlr-interest] Lexer escape conversions
Curtis Clauson
NOSPAM at TheSnakePitDev.com
Fri Feb 23 18:31:35 PST 2007
I've been experimenting with using Antlr to create parsers for escape
sequences. The test grammar I'm using is composed of words delimited by
whitespace. A word is composed of upper/lower case letters and escape
sequences. The escapes are composed of the '@' character followed by
either 'a', 'b', or 'c' that represent the characters '@', '#', and '$'
respectively.
This is easy, but clumsy, to accomplish in the parser as follows:
----------
/*
* Parser Rules
*/
// List the words and drop the whitespace.
content
@init {System.out.println("Words:");}
: (
w1=word {System.out.println(" " + $w1.text);}
(Whitespace w2=word {System.out.println(" " + $w2.text);})*
)?
;
// Collect letters and escapes, but not whitespace, into words.
word : (Letters | Escape)+;
/*
* Lexer Rules
*/
Letters : Letter+;
Whitespace : (' ' | '\t' | '\u000B' | '\f' | '\r' | '\n')+;
Escape
: EscapeFlag (
'a' {text = "@";}
| 'b' {text = "#";}
| 'c' {text = "$";}
)
;
fragment EscapeFlag : '@';
fragment Letter : 'a'..'z' | 'A'..'Z';
----------
It would be cleaner to use the lexer to combine letters and converted
escapes into word tokens. Since an Escape sequence cannot be a token, it
must be a fragment that returns the converted value. The Word token rule
can then append Letters text and Escape values into a single string. The
parser just gets Word tokens that already include the converted escape
values. The following should work:
----------
/*
* Parser Rules
*/
content
@init {System.out.println("Words:");}
: (
Word {System.out.println(" " + $Word.text);}
)*
;
/*
* Lexer Rules
*/
// Gather sequences of letters and escapes between whitespace
// into words.
Word
@init {text = "";}
: (
Letters {text += $Letters.text;}
| Escape {text += $Escape.value;}
)+
;
Whitespace
: (' ' | '\t' | '\u000B' | '\f' | '\r' | '\n')+
{$channel = HIDDEN;}
;
fragment Escape
returns [char value]
: EscapeFlag (
'a' {$value = '@';}
| 'b' {$value = '#';}
| 'c' {$value = '$';}
)
;
fragment EscapeFlag : '@';
fragment Letters : Letter+;
fragment Letter : 'a'..'z' | 'A'..'Z';
----------
However, the generated lexer code for the Word token does not reference
the returned $Escape.value character, instead it creates a new invalid
token for the Escape fragment input substring and appends the token
itself to the text String. For an input word "ju at bmps", you get
"ju[@-1,24:25='@b',<0>,0:-1]ps" instead of "ju#mps".
----------
case 2 :
// C:\\Data\\Projects\\Java\\Antlr\\Examples\\Exp3.g:30:5: Escape
{
int Escape2Start = getCharIndex();
mEscape();
Token Escape2 = new CommonToken(input, Token.INVALID_TOKEN_TYPE,
Token.DEFAULT_CHANNEL, Escape2Start, getCharIndex()-1);
text += Escape2;
}
break;
----------
Weird and oogly. Can anyone shed some light on what I'm doing wrong, a
better way this can be done, or if this is a bug?
--
--------------------------------------------------------
"Any sufficiently over-complicated magic is indistinguishable from
technology." -- Llelan D.
More information about the antlr-interest
mailing list