[antlr-interest] Lexing nested comments

Wed Feb 10 11:35:16 PST 2010

Hi, I am an ANTLR newbie, so I apologize if the answer to this question
ends up being trivial.

I am trying to write an ANTLR lexer for a language that ignores nested
C-style comments. So, something like:

  x = 3 /* /* this is ignored */ as is this */ ;

should just produce four non-hidden tokens: ID = NUMBER ;

I know there are several ways to approach this including using recursive
definitions for the comment tokens as in something like:

 NESTED : '/*' (NESTED | .)* '*/' { $channel = HIDDEN } ;

However, the language in question has the need to consider tokens like:

 /*:bool:*/

as a way of specifying explicit type information. Currently, what I have
gets the nested comments correctly, but then ignores the /*:bool:*/ as
if it is a comment even though I have a separate rule like:

  BOOL : '/*:bool:*/' ;

Is there an easy way around this problem?

Years ago I accomplished something very similar using lex/flex, and
then, later, in SableCC using explicit lexer states where I used a
separate token '/*' to mark the beginning of a comment and then to enter
the "comment" state (and as a side effect bumped up a nested-comment
counter). Since '/*' is shorter than '/*:bool:*/' it did not prevent the
BOOL token from being discovered; explicit states were used to indicate
that the BOOL token should only be scanned if in the "normal" (not the
"comment") state.

It seems to be that possibly ANTLR's semantic predicates could be used
to solve this problem, but whenever I try as in:

  BOOL : { n == 0 }? '/*:bool:*/' ;

if n > 0 it just throws an exception rather than ignoring that rule.

Any light that can be shed on this will be greatly appreciated.

Thanks in advance,

- Michael