[antlr-interest] Matching empty string

Mon Jun 15 07:45:53 PDT 2009

Greetings!

On Sun, 2009-06-14 at 23:09 -0400, Dukie Banderjee wrote:
> Hi,
> My grammar needs to handle the following situation: A line can have
> multiple fields, separated by a delimiter. A field can have multiple
> components, separated by another delimiter.
> If a field or component is blank, it should be counted as a blank
> field or blank component. For example with field delimiter '+' and
> component delimiter ':' :
> UNB++::+123
> is a 'UNB' line with 3 fields. The first field is blank, the second
> field has 3 blank components, and the last field has a single
> component with the value '123'.
> 
> Here is my grammar so far:
> 
> line         : TEXT fields ;
> fields        : field* terminator! ;
> field        : SEP t=fieldText? -> ^(FIELD $t?) ;
> fieldText    : comp (CSEP comp)* ;
> comp        : t=TEXT -> ^(COMP $t) ;
> 
> When a field is blank, e.g. '++', this correctly generates a ^(FIELD)
> with no children. However, I don't know how to get similar behaviour
> for the components, because whereas a field starts with a SEP and
> optional TEXT, the component may or may not have a leading CSEP. 
> 
> When the input is '+::+', there are three components, but the first is
> entirely blank, an empty string. 
> 
> What I would like is that '+::+' creates ^(FIELD ^(COMP) ^(COMP)
> ^(COMP)). How can I accomplish this?
> 

Attached is a complete example accomplishing what you asked for.

In summary, I think the key here is to realize that when there is 1 and
only 1 component, then 1 and only 1 TEXT must be present, and obviously
no CSEP. While when there are 2 or more components, it is the CSEP(s)
that must be present and the TEXT is optional. So I think in light of
this we need to separate the fieldText rule into 2 cases:

fieldText : comp | ((comp? CSEP)+ comp?) ;

The attached grammar is slightly more complicated because I made a new
virtual token EMPTY in order to make empty fields and/or components more
explicit in the resultant AST.

Hope this helps....
   -jbb

-------------- next part --------------
grammar Test;

options {
    output = AST;
    ASTLabelType = CommonTree;
}

tokens { FIELD; COMP; EMPTY; }

@members {
    private static final String [] x = new String[]{
       "UNB++::+123\n",
       "UNB++::z+123\n",
       "UNB++x::z+123\n",
       "UNB++:y:+123\n",
       "UNB++x:y:z+123\n"
    };

    public static void main(String [] args) {
        for( int i = 0; i < x.length; ++i ) {
            try {
                System.out.println("about to parse:`"+x[i]+"`");
                TestLexer lexer = new TestLexer(new ANTLRStringStream(x[i]));
                CommonTokenStream tokens = new CommonTokenStream(lexer);

                TestParser parser = new TestParser(tokens);
                TestParser.start_return p_result = parser.start();

                CommonTree ast = p_result.tree;
                if( ast == null ) {
                   System.out.println("resultant tree: is NULL");
                } else {
                   System.out.println("resultant tree: " + ast.toStringTree());
                }
                System.out.println();
            } catch(Exception e) {
                e.printStackTrace();
            }
        }
    }
}

start : line EOF!;

line          : TEXT fields ;

fields        : field* terminator! ;

// make the fact that a field is empty explicit.
// replaced this rule--- field : SEP t=fieldText? -> ^(FIELD $t?) ;
// with---
field         : SEP! opt_fieldText ;

opt_fieldText : (t=fieldText -> ^(FIELD $t))
              | (/*empty*/ -> ^(FIELD EMPTY)) ; 

fieldText     : comp // single component, value MUST be present

              | ((opt_comp CSEP!)+ opt_comp) ; // 2 or more components,
                                               // each possibly empty

opt_comp      : comp
              | (/*empty*/ -> ^(COMP EMPTY)) ;

comp          : t=TEXT -> ^(COMP $t) ;

terminator : NL;
SEP : '+' ;
CSEP : ':' ;
TEXT : ('a'..'z'|'A'..'Z'|'0'..'9')+ ;

NL : ('\r'|'\n')+;