[antlr-interest] String lexing and partial tokens

Sat Nov 25 14:51:40 PST 2006

I think you might need your keyword before the filename, to differentiate it
from the STRING rule.

FILENAME: 'include' '"' content=UnquotedText '"' { emit($content);

/2ob

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of Gavin Lambert
> Sent: 25 November 2006 22:10
> To: Terence Parr
> Cc: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] String lexing and partial tokens
> 
> At 06:58 26/11/2006, Terence Parr wrote:
>  >
>  >> On an only-slightly-related note, I was also wondering what's
>  >> the right way to deal with lexical ambiguity?  Say I've got
> one
>  >> parsing context (eg. after a #include in C) where backslashes
>  >> are treated literally, not as escapes, and another context
>  >> (anywhere else) where they should be used as an escape
> sequence.
>  >> And again, ideally I want the resulting token to contain the
>  >> 'real' string (ie. after escapes had been acted on).  Is this
>  >> even possible?  (I imagine you could do it by treating it as
> an
>  >> island grammar.  But that seems a little heavyweight.)
>  >
>  >Easy enough, just match \  with a rule called FILENAME after
>  >'#include'.
> 
> So, this would mean that the lexer and grammar are run in
> parallel, so that the grammar can influence the lexer?  For some
> reason, I always thought that the character stream was completely
> lexed, and then the resulting tokens were parsed.
> 
> Anyway, I tried that and it gave me a warning:
> 
> warning(208): Message.g3:99:1: The following token definitions are
> unreachable: STRING
> 
> The relevant definitions are:
> 
> FILENAME: '"' content=UnquotedText '"' { emit($content);
> ltoken()->type = FILENAME; };
> 
> fragment UnquotedText:	(~'"')* ;
> 
> STRING: '"' content=EscapedText '"'    { emit($content);
> ltoken()->type = STRING; };
> 
> fragment EscapedText: (EscapeSequence | ~('\\' | '"'))* ;
> 
> 
> And yes, both FILENAME and STRING are referenced by the grammar.