[antlr-interest] String lexing and partial tokens

Sat Nov 25 14:10:21 PST 2006

At 06:58 26/11/2006, Terence Parr wrote:
 >
 >> On an only-slightly-related note, I was also wondering what's
 >> the right way to deal with lexical ambiguity?  Say I've got 
one
 >> parsing context (eg. after a #include in C) where backslashes
 >> are treated literally, not as escapes, and another context
 >> (anywhere else) where they should be used as an escape 
sequence.
 >> And again, ideally I want the resulting token to contain the
 >> 'real' string (ie. after escapes had been acted on).  Is this
 >> even possible?  (I imagine you could do it by treating it as 
an
 >> island grammar.  But that seems a little heavyweight.)
 >
 >Easy enough, just match \  with a rule called FILENAME after
 >'#include'.

So, this would mean that the lexer and grammar are run in 
parallel, so that the grammar can influence the lexer?  For some 
reason, I always thought that the character stream was completely 
lexed, and then the resulting tokens were parsed.

Anyway, I tried that and it gave me a warning:

warning(208): Message.g3:99:1: The following token definitions are 
unreachable: STRING

The relevant definitions are:

FILENAME: '"' content=UnquotedText '"' { emit($content); 
ltoken()->type = FILENAME; };

fragment UnquotedText:	(~'"')* ;

STRING: '"' content=EscapedText '"'    { emit($content); 
ltoken()->type = STRING; };

fragment EscapedText: (EscapeSequence | ~('\\' | '"'))* ;

And yes, both FILENAME and STRING are referenced by the grammar.