[antlr-interest] Trouble parsing a language where '{' has too many meanings

Fri Jul 6 12:13:00 PDT 2007

Hello everyone,

Antlr is a great tool and Terence and his folks have been doing a
wonderful job on it. Thank you!!

However, I am getting frustrated on a problem I am not able to solve.
Maybe someone here can help me, any advice is greatly appreciated.

I am trying to write a parser that recognizes data record schema
descriptions like the the following:

blubber {
    Type = Hash
    ShortHelp = "A short comment"
    LongHelp = {
        Some other comment ending with a dot.
    }.
    Items {
        FirstName {
            Type = String, ShortHelp = "Hallo"
            LongHelp = {
                Long Explanatory test spanning
                over multiple lines
            }.
        }
        LastName {
            Type = String
            Default = "Blah"
            ShortHelp = "(not so) interesting comment"
        }
    }
}

The grammar is pretty simple, however I am having a real hard time to
handle the multi-line text in the 'LongHelp' blocks.
The problem seems to be the curly braces surrounding (not only) the
multi-line text block.

My grammar looks as follows (condensed version):

grammar TEST;

options {
    output = AST;
}

tokens {
    ROOT ;
    SECTION ;
    ID ;
    DECL ;
    COMMA = ',' ;
    DOT = '.' ;
    QUOT = '"' ;
}

@lexer::header {
package foo;
}

@header {
package foo;
}

root
    :    section+ EOF
            -> ^( ROOT section+ )
    ;

section
    :    IDENT '{' section_type_decl '}'
            -> ^(SECTION ^(ID IDENT) section_type_decl)
     ;

section_type_decl
     :    'Type' '=' hash_type_decl
    ;

hash_type_decl
    :    'Hash' ( COMMA? hash_decl_elem ( COMMA? hash_decl_elem )* )?
            -> ^( DECL ^( 'Type' 'Hash') hash_decl_elem+ )
    ;

hash_decl_elem
    :    'Items' hash_items_decl
            -> ^('Items' hash_items_decl)
    |    help
    ;

hash_items_decl
    :    '{' hash_item_decl ( (COMMA?)! hash_item_decl )* '}'
    ;

hash_item_decl
    :    IDENT '{' hash_item_decl_elem ( COMMA? hash_item_decl_elem )* '}'
            -> ^( DECL ^( ID IDENT ) hash_item_decl_elem+)
    ;

hash_item_decl_elem
    :    'Type' '=' basic_type_decl
    ;

basic_type_decl
    :    'String' ( COMMA? string_decl_elem ( COMMA? string_decl_elem)* )?
            -> ^( 'Type' 'String' ) string_decl_elem*
    ;

string_decl_elem
    :    'Default' '=' STRING
            -> ^('Default' STRING)
    |    'Mandatory' '=' ( 'true' -> ^('Mandatory' 'true' ) | 'false' ->
^( 'Mandatory' 'false' ) )
    |    help
    ;

help
    :    'ShortHelp' '=' STRING  -> ^( 'ShortHelp' STRING )
//    |    'LongHelp' '=' ML_TEXT   -> ^( 'LongHelp' ML_TEXT )
    |    'LongHelp' '=' text   -> ^( 'LongHelp' text )
    ;

text
    :    '{'! ( options { greedy=false; } : .* )^ '}'! DOT!
    ;

fragment DIGIT
    :    ('0'..'9') ;

fragment LETTER
    :    ('A'..'Z' | 'a'..'z') ;

fragment UC_LETTER
    :    ('A'..'Z') ;

STRING
    :    QUOT ( ~( QUOT | '\n' ) )* QUOT {setText(getText().substring(1,
getText().length()-1));}
    ;

ML_TEXT
    :    '{'
        ( options {greedy=false;} : . )*
        '}' '.' {setText(getText().substring(1, getText().length()-2));}
    ;

IDENT
    :    UC_LETTER ( LETTER | DIGIT )*
    ;

WS  :     ( ' ' | '\r' '\n' | '\n' | '\t' ) { $channel=HIDDEN; }
    ;

I have tried several ways to capture the LongHelp content, however none
worked. Cause is that the lexer thinks he sees a multi-line comment as
soon as he sees a '{' (even if that '{' does not denote the beginning of
a ml text block).

What confuses me is that writing a parser for this by hand appears
straight forward: I am the parser and I have seen 'LongHelp' followed by
'=' followed by '{', so, lexer, give me all text until you see a '}'
followed by a '.'.
Apparently ANTLR works exactly the opposite way: I am the lexer, an I
see a '{'. I assume that this denotes the start of a multi-line comment,
so I read everything until I see a '}' followed by a '.' an give this to
the parser.

I have tried not matching the surroundings '{' ... '}' '.' in a parser
rule but this doesn't help for apparent reasons.

Can someone shed light on this, please? It seems to be such a simple
problem...

thank you,
felix