[antlr-interest] Trouble parsing a language where '{' has too many meanings
Felix Schmid
felix at belugalounge.net
Fri Jul 6 12:13:00 PDT 2007
Hello everyone,
Antlr is a great tool and Terence and his folks have been doing a
wonderful job on it. Thank you!!
However, I am getting frustrated on a problem I am not able to solve.
Maybe someone here can help me, any advice is greatly appreciated.
I am trying to write a parser that recognizes data record schema
descriptions like the the following:
blubber {
Type = Hash
ShortHelp = "A short comment"
LongHelp = {
Some other comment ending with a dot.
}.
Items {
FirstName {
Type = String, ShortHelp = "Hallo"
LongHelp = {
Long Explanatory test spanning
over multiple lines
}.
}
LastName {
Type = String
Default = "Blah"
ShortHelp = "(not so) interesting comment"
}
}
}
The grammar is pretty simple, however I am having a real hard time to
handle the multi-line text in the 'LongHelp' blocks.
The problem seems to be the curly braces surrounding (not only) the
multi-line text block.
My grammar looks as follows (condensed version):
grammar TEST;
options {
output = AST;
}
tokens {
ROOT ;
SECTION ;
ID ;
DECL ;
COMMA = ',' ;
DOT = '.' ;
QUOT = '"' ;
}
@lexer::header {
package foo;
}
@header {
package foo;
}
root
: section+ EOF
-> ^( ROOT section+ )
;
section
: IDENT '{' section_type_decl '}'
-> ^(SECTION ^(ID IDENT) section_type_decl)
;
section_type_decl
: 'Type' '=' hash_type_decl
;
hash_type_decl
: 'Hash' ( COMMA? hash_decl_elem ( COMMA? hash_decl_elem )* )?
-> ^( DECL ^( 'Type' 'Hash') hash_decl_elem+ )
;
hash_decl_elem
: 'Items' hash_items_decl
-> ^('Items' hash_items_decl)
| help
;
hash_items_decl
: '{' hash_item_decl ( (COMMA?)! hash_item_decl )* '}'
;
hash_item_decl
: IDENT '{' hash_item_decl_elem ( COMMA? hash_item_decl_elem )* '}'
-> ^( DECL ^( ID IDENT ) hash_item_decl_elem+)
;
hash_item_decl_elem
: 'Type' '=' basic_type_decl
;
basic_type_decl
: 'String' ( COMMA? string_decl_elem ( COMMA? string_decl_elem)* )?
-> ^( 'Type' 'String' ) string_decl_elem*
;
string_decl_elem
: 'Default' '=' STRING
-> ^('Default' STRING)
| 'Mandatory' '=' ( 'true' -> ^('Mandatory' 'true' ) | 'false' ->
^( 'Mandatory' 'false' ) )
| help
;
help
: 'ShortHelp' '=' STRING -> ^( 'ShortHelp' STRING )
// | 'LongHelp' '=' ML_TEXT -> ^( 'LongHelp' ML_TEXT )
| 'LongHelp' '=' text -> ^( 'LongHelp' text )
;
text
: '{'! ( options { greedy=false; } : .* )^ '}'! DOT!
;
fragment DIGIT
: ('0'..'9') ;
fragment LETTER
: ('A'..'Z' | 'a'..'z') ;
fragment UC_LETTER
: ('A'..'Z') ;
STRING
: QUOT ( ~( QUOT | '\n' ) )* QUOT {setText(getText().substring(1,
getText().length()-1));}
;
ML_TEXT
: '{'
( options {greedy=false;} : . )*
'}' '.' {setText(getText().substring(1, getText().length()-2));}
;
IDENT
: UC_LETTER ( LETTER | DIGIT )*
;
WS : ( ' ' | '\r' '\n' | '\n' | '\t' ) { $channel=HIDDEN; }
;
I have tried several ways to capture the LongHelp content, however none
worked. Cause is that the lexer thinks he sees a multi-line comment as
soon as he sees a '{' (even if that '{' does not denote the beginning of
a ml text block).
What confuses me is that writing a parser for this by hand appears
straight forward: I am the parser and I have seen 'LongHelp' followed by
'=' followed by '{', so, lexer, give me all text until you see a '}'
followed by a '.'.
Apparently ANTLR works exactly the opposite way: I am the lexer, an I
see a '{'. I assume that this denotes the start of a multi-line comment,
so I read everything until I see a '}' followed by a '.' an give this to
the parser.
I have tried not matching the surroundings '{' ... '}' '.' in a parser
rule but this doesn't help for apparent reasons.
Can someone shed light on this, please? It seems to be such a simple
problem...
thank you,
felix
More information about the antlr-interest
mailing list