[antlr-interest] Parsing "comment-like" sequences of arbitrary characters

Tue May 17 15:55:22 PDT 2011

Hello ANTLR-ites,

I'm trying to parse an "options" structure, like the following:

options {
   foo {
      bar {
         ww: $32.50;
         xx: Jekyll & Hyde;
      }
      yy.zz: @15% p/a;
   }
}

(Please ignore the non-sensical values for ww, xx and yy.zz -- I'm just making a point, which will become clearer below).  This options structure will be followed by a query expression whose grammar is more complicated, and includes ints/floats, identifiers, operators, etc. etc.

The grammar I have for parsing the options structure looks like the below. (The grammar for the query language is complicated and therefore omitted.)

<snip>

// ... other stuff here
tokens {
   // ... other ad hoc token values
   OPTION;
   OPTION_BLOCK;
   OPTION_VALUE;
}

// ...

query_options
  : OPTIONS^ option_block
  ;

option_block
  : L_BRACE option_def* R_BRACE ->
    ^(OPTION_BLOCK option_def*)
  ;

option_def
  : option_name option_value ->
    ^(OPTION option_name option_value)
  ;

option_name
  : ID (DOT^ ID)*
  ;

option_value
  : COLON^ (~SEMICOLON)* SEMICOLON!
  | option_block
  ;

//... other stuff here
//...

OPTIONS: 'options';
ID: (LETTER | '_') (LETTER | DIGIT | '_')*;
DOT: '.';
L_BRACE: '{';
R_BRACE: '}';
COLON: ':';
SEMICOLON: ';';

SL_COMMENT: '#' ~('\r' | '\n')* NEWLINE { skip(); };
WS: (' ' | '\f' | '\r' | '\t')+ { skip(); };

...

</snip>

As mentioned, the "options" clause is part of a larger grammar for a language that includes operators, identifiers, numbers, etc.,  However, within the options clause, I want the characters between the colon and the semicolon to be treated as a single string, regardless of the fact that it may contain characters that lex into other tokens used by the language.  This feels like I should be able to use the same techniques as used in comment-stripping (i.e,. see the line that has COLON^...).  But this doesn't seem to work:
-  The "stray" characters that are not used elsewhere in the grammar are ignored and don't show up in the parse tree (e.g., $, @, %, &, in the example above)
-  Character sequences that form valid tokens for the rest of the language (like integers or identifiers) are lexed into those respective tokens instead of being slurped into a single string as intended.

E.g., when I input a string like "options { foo: $ %     1 2 45 ^ $ $$$; }" and display the resulting tree.toStringTree(), I get
"(options (OPTION_BLOCK (OPTION foo (: 1 2 45))))"

Any guidance you have on the above will be greatly appreciated.

Thanks in advance.

++Rajesh