[antlr-interest] Parsing "comment-like" sequences of arbitrary characters
Bart Kiers
bkiers at gmail.com
Wed May 18 00:55:34 PDT 2011
Hi Rajesh,
Inside a parser rule, the `~` negates tokens, not characters. So if you have
no lexer rule that tokenizes one of: '%', ''^' or '$', then ~SEMICOLON won't
match any of such tokens/characters.
Your grammar (with minor modifications):
grammar Test;
options {
output=AST;
}
tokens {
OPTION;
OPTION_BLOCK;
}
query_options
: OPTIONS^ option_block
;
option_block
: L_BRACE option_def* R_BRACE -> ^(OPTION_BLOCK option_def*)
;
option_def
: option_name option_value -> ^(OPTION option_name option_value)
;
option_name
: ID (DOT^ ID)*
;
option_value
: COLON^ (~SEMICOLON)* SEMICOLON!
| option_block
;
OPTIONS : 'options';
ID: (LETTER | '_') (LETTER | DIGIT | '_')*;
DOLLAR: '$';
PERCENT: '%';
CARET: '^';
DOT: '.';
L_BRACE: '{';
R_BRACE: '}';
COLON: ':';
SEMICOLON: ';';
DIGIT : '0'..'9';
SL_COMMENT: '#' ~('\r' | '\n')* { skip(); };
WS: (' ' | '\f' | '\r' | '\t')+ { skip(); };
fragment LETTER : 'a'..'z' | 'A'..'Z';
parses the input: "options { foo: $ % 1 2 45 ^ $ $$$; }" as follows:
(options (OPTION_BLOCK (OPTION foo (: $ % 1 2 4 5 ^ $ $ $ $))))
as you can see after running the test rig:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("options { foo: $ %
1 2 45 ^ $ $$$; }");
TestLexer lexer = new TestLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
TestParser.query_options_return returnValue =
parser.query_options();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
System.out.println("-----------------------\n" +
tree.toStringTree());
}
}
Regards,
Bart.
On Wed, May 18, 2011 at 12:55 AM, Rajesh Raman <rr at fb.com> wrote:
> Hello ANTLR-ites,
>
> I'm trying to parse an "options" structure, like the following:
>
> options {
> foo {
> bar {
> ww: $32.50;
> xx: Jekyll & Hyde;
> }
> yy.zz: @15% p/a;
> }
> }
>
> (Please ignore the non-sensical values for ww, xx and yy.zz -- I'm just
> making a point, which will become clearer below). This options structure
> will be followed by a query expression whose grammar is more complicated,
> and includes ints/floats, identifiers, operators, etc. etc.
>
> The grammar I have for parsing the options structure looks like the below.
> (The grammar for the query language is complicated and therefore omitted.)
>
> <snip>
>
> // ... other stuff here
> tokens {
> // ... other ad hoc token values
> OPTION;
> OPTION_BLOCK;
> OPTION_VALUE;
> }
>
> // ...
>
> query_options
> : OPTIONS^ option_block
> ;
>
> option_block
> : L_BRACE option_def* R_BRACE ->
> ^(OPTION_BLOCK option_def*)
> ;
>
> option_def
> : option_name option_value ->
> ^(OPTION option_name option_value)
> ;
>
> option_name
> : ID (DOT^ ID)*
> ;
>
> option_value
> : COLON^ (~SEMICOLON)* SEMICOLON!
> | option_block
> ;
>
> //... other stuff here
> //...
>
> OPTIONS: 'options';
> ID: (LETTER | '_') (LETTER | DIGIT | '_')*;
> DOT: '.';
> L_BRACE: '{';
> R_BRACE: '}';
> COLON: ':';
> SEMICOLON: ';';
>
> SL_COMMENT: '#' ~('\r' | '\n')* NEWLINE { skip(); };
> WS: (' ' | '\f' | '\r' | '\t')+ { skip(); };
>
> ...
>
> </snip>
>
> As mentioned, the "options" clause is part of a larger grammar for a
> language that includes operators, identifiers, numbers, etc., However,
> within the options clause, I want the characters between the colon and the
> semicolon to be treated as a single string, regardless of the fact that it
> may contain characters that lex into other tokens used by the language.
> This feels like I should be able to use the same techniques as used in
> comment-stripping (i.e,. see the line that has COLON^...). But this doesn't
> seem to work:
> - The "stray" characters that are not used elsewhere in the grammar are
> ignored and don't show up in the parse tree (e.g., $, @, %, &, in the
> example above)
> - Character sequences that form valid tokens for the rest of the language
> (like integers or identifiers) are lexed into those respective tokens
> instead of being slurped into a single string as intended.
>
> E.g., when I input a string like "options { foo: $ % 1 2 45 ^ $ $$$; }"
> and display the resulting tree.toStringTree(), I get
> "(options (OPTION_BLOCK (OPTION foo (: 1 2 45))))"
>
> Any guidance you have on the above will be greatly appreciated.
>
> Thanks in advance.
>
> ++Rajesh
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>
More information about the antlr-interest
mailing list