[antlr-interest] Matching empty string
John B. Brodie
jbb at acm.org
Mon Jun 15 07:45:53 PDT 2009
Greetings!
On Sun, 2009-06-14 at 23:09 -0400, Dukie Banderjee wrote:
> Hi,
> My grammar needs to handle the following situation: A line can have
> multiple fields, separated by a delimiter. A field can have multiple
> components, separated by another delimiter.
> If a field or component is blank, it should be counted as a blank
> field or blank component. For example with field delimiter '+' and
> component delimiter ':' :
> UNB++::+123
> is a 'UNB' line with 3 fields. The first field is blank, the second
> field has 3 blank components, and the last field has a single
> component with the value '123'.
>
> Here is my grammar so far:
>
> line : TEXT fields ;
> fields : field* terminator! ;
> field : SEP t=fieldText? -> ^(FIELD $t?) ;
> fieldText : comp (CSEP comp)* ;
> comp : t=TEXT -> ^(COMP $t) ;
>
> When a field is blank, e.g. '++', this correctly generates a ^(FIELD)
> with no children. However, I don't know how to get similar behaviour
> for the components, because whereas a field starts with a SEP and
> optional TEXT, the component may or may not have a leading CSEP.
>
> When the input is '+::+', there are three components, but the first is
> entirely blank, an empty string.
>
> What I would like is that '+::+' creates ^(FIELD ^(COMP) ^(COMP)
> ^(COMP)). How can I accomplish this?
>
Attached is a complete example accomplishing what you asked for.
In summary, I think the key here is to realize that when there is 1 and
only 1 component, then 1 and only 1 TEXT must be present, and obviously
no CSEP. While when there are 2 or more components, it is the CSEP(s)
that must be present and the TEXT is optional. So I think in light of
this we need to separate the fieldText rule into 2 cases:
fieldText : comp | ((comp? CSEP)+ comp?) ;
The attached grammar is slightly more complicated because I made a new
virtual token EMPTY in order to make empty fields and/or components more
explicit in the resultant AST.
Hope this helps....
-jbb
-------------- next part --------------
grammar Test;
options {
output = AST;
ASTLabelType = CommonTree;
}
tokens { FIELD; COMP; EMPTY; }
@members {
private static final String [] x = new String[]{
"UNB++::+123\n",
"UNB++::z+123\n",
"UNB++x::z+123\n",
"UNB++:y:+123\n",
"UNB++x:y:z+123\n"
};
public static void main(String [] args) {
for( int i = 0; i < x.length; ++i ) {
try {
System.out.println("about to parse:`"+x[i]+"`");
TestLexer lexer = new TestLexer(new ANTLRStringStream(x[i]));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
TestParser.start_return p_result = parser.start();
CommonTree ast = p_result.tree;
if( ast == null ) {
System.out.println("resultant tree: is NULL");
} else {
System.out.println("resultant tree: " + ast.toStringTree());
}
System.out.println();
} catch(Exception e) {
e.printStackTrace();
}
}
}
}
start : line EOF!;
line : TEXT fields ;
fields : field* terminator! ;
// make the fact that a field is empty explicit.
// replaced this rule--- field : SEP t=fieldText? -> ^(FIELD $t?) ;
// with---
field : SEP! opt_fieldText ;
opt_fieldText : (t=fieldText -> ^(FIELD $t))
| (/*empty*/ -> ^(FIELD EMPTY)) ;
fieldText : comp // single component, value MUST be present
| ((opt_comp CSEP!)+ opt_comp) ; // 2 or more components,
// each possibly empty
opt_comp : comp
| (/*empty*/ -> ^(COMP EMPTY)) ;
comp : t=TEXT -> ^(COMP $t) ;
terminator : NL;
SEP : '+' ;
CSEP : ':' ;
TEXT : ('a'..'z'|'A'..'Z'|'0'..'9')+ ;
NL : ('\r'|'\n')+;
More information about the antlr-interest
mailing list