[antlr-interest] Stripping Tokens, Skipping leading text

Fri May 8 16:48:41 PDT 2009

At 11:33 9/05/2009, Christian Schladetsch wrote:
>My attempts so far have failed:
>
>     CODE_BLOCK: '[[' (options{greedy=false;}:.)* ']]' ;
>
>This correctly parses the entire token, but the token value in 
>the lexer contains the enclosing delimiters '[[' and ']]'

CODE_BLOCK: '[[' .* ']]' { setText($text.substring(2, 
$length.length()-4)); };

(Minor variation needed to make it C#, but that should give you 
the general idea.)

>While I'm here, I have a similar problem. I'd like to skip all 
>input until a starting token is found:
>
>     any text here that is not parsed lah di dah /** text here is 
> parsed **/ no text parsing here

You might want to look into filter lexers, or island 
grammars.  But anyway:

START
   : ( ~'/'
     | '/' ~'*'
     | '/*' ~'*'
     )*
     '/**'
   ;

This sort of thing is dangerous, though; there's a very good 
probability that it will mess up the contents of what you're 
trying to parse as well.

A better solution is to match the whole /** (anything) **/ 
sequence as a single lexer token, and then run another 
lexer/parser over the result -- ie. an island grammar.