[antlr-interest] Parse 1 - N repeats

Mon Feb 8 06:06:25 PST 2010

Hi Adam,

You could handle it in (plain) programming logic inside your grammar.
Here's a little demo:

grammar Test;

@parser::members {
  public static void main(String[] args) throws Exception {
    String text =
        "FIELD1\n"+
        "REPEATING_GROUP <fields=2> <min=0, max=20>\n"+
        "FIELD2\n"+
        "FIELD3\n"+
        "FIELD4";
    ANTLRStringStream in = new ANTLRStringStream(text);
    TestLexer lexer = new TestLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    new TestParser(tokens).parse();
  }

  class Repeat {

    final List<String> fieldList;
    final int fields;
    final int min;
    final int max;

    Repeat(int fields, int min, int max) {
        this.fieldList = new ArrayList<String>(fields);
        this.fields = fields;
        this.min = min;
        this.max = max;
    }

    boolean done() {
        return fieldList.size() == fields;
    }

    public String toString() {
      return String.format("fields=\%s, min=\%d, max=\%d", fieldList, min,
max);
    }
  }
}

parse
  :  (  rp=repeat     {System.out.println("repeat :: "+$rp.r);}
     |  id=Identifier {System.out.println("field  :: "+$id.text);}
     )*
     EOF
  ;

repeat returns [Repeat r]
  :  Identifier '<' 'fields' '=' fields=Identifier '>' '<' 'min' '='
min=Identifier ',' 'max' '=' max=Identifier '>'
     {$r = new Repeat(Integer.valueOf($fields.text),
Integer.valueOf($min.text), Integer.valueOf($max.text));}
     (id=Identifier {$r.fieldList.add($id.text); if($r.done()) return $r;}
)*
  ;

Identifier
  :  ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' )+
  ;

WhiteSpace
  :  ( ' ' | '\t' | '\r' | '\n' ) {skip();}
  ;

As you see, whenever the size of the fieldList hits the total, $r is being
returned (and no more id=Identifier will be "eaten").
When you compile and execute the TestParser class, the following is being
printed:

field  :: FIELD1
repeat :: fields=[FIELD2, FIELD3], min=0, max=20
field  :: FIELD4

Regards,

Bart.

On Mon, Feb 8, 2010 at 1:56 PM, Adam Connelly <
adam.rpconnelly at googlemail.com> wrote:

> Hi,
>
> Sorry if this is answered elsewhere, but I'm not really sure what to search
> for.
>
> I'm trying to parse a language that includes repeating groups. The problem
> is that they don't include terminators, so you can't tell the difference
> between the last item in the group, and the next section. Here's an
> example:
>
> FIELD1
> REPEATING_GROUP   <fields=2> <min=0, max=20>
>    FIELD2
>    FIELD3
> FIELD4
> ...
>
> "fields" specifies the number of fields contained in the group. At the
> moment I've got the following rules, but the problem is that it means that
> the repeating group rule doesn't get its fields associated with it:
>
> recordDefinition
>    :    RECORD (IDENTIFIER | repeatingGroup)+
>    ;
>
> repeatingGroup
>    :    IDENTIFIER
>        '<' NUMBER_OF_FIELDS '=' fieldCount=NUMBER '>'
>        '<' NUMBER_OF_REPEATS '=' min=NUMBER ',' max=NUMBER '>'
>    ;
>
> Ideally I could do something like:
>
> repeatingGroup
>    :    IDENTIFIER
>        '<' NUMBER_OF_FIELDS '=' fieldCount=NUMBER '>'
>        '<' NUMBER_OF_REPEATS '=' min=NUMBER ',' max=NUMBER '>'
>        IDENTIFIER{1, $fieldCount}
>    ;
>
> But I know you can't do that. What would the best way be to go about
> parsing
> this? Can I build an AST then modify it to put the identifiers for the
> repeating group in the right place.
>
> Cheers,
> Adam
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe:
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>