[antlr-interest] Parsing "comment-like" sequences of arbitrary characters
Rajesh Raman
rr at fb.com
Tue May 17 15:55:22 PDT 2011
Hello ANTLR-ites,
I'm trying to parse an "options" structure, like the following:
options {
foo {
bar {
ww: $32.50;
xx: Jekyll & Hyde;
}
yy.zz: @15% p/a;
}
}
(Please ignore the non-sensical values for ww, xx and yy.zz -- I'm just making a point, which will become clearer below). This options structure will be followed by a query expression whose grammar is more complicated, and includes ints/floats, identifiers, operators, etc. etc.
The grammar I have for parsing the options structure looks like the below. (The grammar for the query language is complicated and therefore omitted.)
<snip>
// ... other stuff here
tokens {
// ... other ad hoc token values
OPTION;
OPTION_BLOCK;
OPTION_VALUE;
}
// ...
query_options
: OPTIONS^ option_block
;
option_block
: L_BRACE option_def* R_BRACE ->
^(OPTION_BLOCK option_def*)
;
option_def
: option_name option_value ->
^(OPTION option_name option_value)
;
option_name
: ID (DOT^ ID)*
;
option_value
: COLON^ (~SEMICOLON)* SEMICOLON!
| option_block
;
//... other stuff here
//...
OPTIONS: 'options';
ID: (LETTER | '_') (LETTER | DIGIT | '_')*;
DOT: '.';
L_BRACE: '{';
R_BRACE: '}';
COLON: ':';
SEMICOLON: ';';
SL_COMMENT: '#' ~('\r' | '\n')* NEWLINE { skip(); };
WS: (' ' | '\f' | '\r' | '\t')+ { skip(); };
...
</snip>
As mentioned, the "options" clause is part of a larger grammar for a language that includes operators, identifiers, numbers, etc., However, within the options clause, I want the characters between the colon and the semicolon to be treated as a single string, regardless of the fact that it may contain characters that lex into other tokens used by the language. This feels like I should be able to use the same techniques as used in comment-stripping (i.e,. see the line that has COLON^...). But this doesn't seem to work:
- The "stray" characters that are not used elsewhere in the grammar are ignored and don't show up in the parse tree (e.g., $, @, %, &, in the example above)
- Character sequences that form valid tokens for the rest of the language (like integers or identifiers) are lexed into those respective tokens instead of being slurped into a single string as intended.
E.g., when I input a string like "options { foo: $ % 1 2 45 ^ $ $$$; }" and display the resulting tree.toStringTree(), I get
"(options (OPTION_BLOCK (OPTION foo (: 1 2 45))))"
Any guidance you have on the above will be greatly appreciated.
Thanks in advance.
++Rajesh
More information about the antlr-interest
mailing list