[antlr-interest] Parsing "comment-like" sequences of arbitrary characters

Wed May 18 00:55:34 PDT 2011

Hi Rajesh,

Inside a parser rule, the `~` negates tokens, not characters. So if you have
no lexer rule that tokenizes one of: '%', ''^' or '$', then ~SEMICOLON won't
match any of such tokens/characters.

Your grammar (with minor modifications):

grammar Test;

options {
  output=AST;
}

tokens {
  OPTION;
  OPTION_BLOCK;
}

query_options
 : OPTIONS^ option_block
 ;

option_block
 : L_BRACE option_def* R_BRACE -> ^(OPTION_BLOCK option_def*)
 ;

option_def
 : option_name option_value -> ^(OPTION option_name option_value)
 ;

option_name
 : ID (DOT^ ID)*
 ;

option_value
 : COLON^ (~SEMICOLON)* SEMICOLON!
 | option_block
 ;

OPTIONS : 'options';
ID: (LETTER | '_') (LETTER | DIGIT | '_')*;
DOLLAR: '$';
PERCENT: '%';
CARET: '^';
DOT: '.';
L_BRACE: '{';
R_BRACE: '}';
COLON: ':';
SEMICOLON: ';';
DIGIT : '0'..'9';
SL_COMMENT: '#' ~('\r' | '\n')* { skip(); };
WS: (' ' | '\f' | '\r' | '\t')+ { skip(); };
fragment LETTER : 'a'..'z' | 'A'..'Z';

parses the input: "options { foo: $ %     1 2 45 ^ $ $$$; }" as follows:

(options (OPTION_BLOCK (OPTION foo (: $ % 1 2 4 5 ^ $ $ $ $))))

as you can see after running the test rig:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("options { foo: $ %
1 2 45 ^ $ $$$; }");
        TestLexer lexer = new TestLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        TestParser.query_options_return returnValue =
parser.query_options();
        CommonTree tree = (CommonTree)returnValue.getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
        System.out.println("-----------------------\n" +
tree.toStringTree());
    }
}

Regards,

Bart.

On Wed, May 18, 2011 at 12:55 AM, Rajesh Raman <rr at fb.com> wrote:

> Hello ANTLR-ites,
>
> I'm trying to parse an "options" structure, like the following:
>
> options {
>   foo {
>      bar {
>         ww: $32.50;
>         xx: Jekyll & Hyde;
>      }
>      yy.zz: @15% p/a;
>   }
> }
>
> (Please ignore the non-sensical values for ww, xx and yy.zz -- I'm just
> making a point, which will become clearer below).  This options structure
> will be followed by a query expression whose grammar is more complicated,
> and includes ints/floats, identifiers, operators, etc. etc.
>
> The grammar I have for parsing the options structure looks like the below.
> (The grammar for the query language is complicated and therefore omitted.)
>
> <snip>
>
> // ... other stuff here
> tokens {
>   // ... other ad hoc token values
>   OPTION;
>   OPTION_BLOCK;
>   OPTION_VALUE;
> }
>
> // ...
>
> query_options
>  : OPTIONS^ option_block
>  ;
>
> option_block
>  : L_BRACE option_def* R_BRACE ->
>    ^(OPTION_BLOCK option_def*)
>  ;
>
> option_def
>  : option_name option_value ->
>    ^(OPTION option_name option_value)
>  ;
>
> option_name
>  : ID (DOT^ ID)*
>  ;
>
> option_value
>  : COLON^ (~SEMICOLON)* SEMICOLON!
>  | option_block
>  ;
>
> //... other stuff here
> //...
>
> OPTIONS: 'options';
> ID: (LETTER | '_') (LETTER | DIGIT | '_')*;
> DOT: '.';
> L_BRACE: '{';
> R_BRACE: '}';
> COLON: ':';
> SEMICOLON: ';';
>
> SL_COMMENT: '#' ~('\r' | '\n')* NEWLINE { skip(); };
> WS: (' ' | '\f' | '\r' | '\t')+ { skip(); };
>
> ...
>
> </snip>
>
> As mentioned, the "options" clause is part of a larger grammar for a
> language that includes operators, identifiers, numbers, etc.,  However,
> within the options clause, I want the characters between the colon and the
> semicolon to be treated as a single string, regardless of the fact that it
> may contain characters that lex into other tokens used by the language.
>  This feels like I should be able to use the same techniques as used in
> comment-stripping (i.e,. see the line that has COLON^...).  But this doesn't
> seem to work:
> -  The "stray" characters that are not used elsewhere in the grammar are
> ignored and don't show up in the parse tree (e.g., $, @, %, &, in the
> example above)
> -  Character sequences that form valid tokens for the rest of the language
> (like integers or identifiers) are lexed into those respective tokens
> instead of being slurped into a single string as intended.
>
> E.g., when I input a string like "options { foo: $ %     1 2 45 ^ $ $$$; }"
> and display the resulting tree.toStringTree(), I get
> "(options (OPTION_BLOCK (OPTION foo (: 1 2 45))))"
>
> Any guidance you have on the above will be greatly appreciated.
>
> Thanks in advance.
>
> ++Rajesh
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>