[antlr-interest] A very basic grammar--and I'm confused!

Mon Aug 18 13:33:53 PDT 2008

At 04:00 19/08/2008, Richard Steele wrote:
 >Generally speaking, this file format has a syntax that looks
 >something like this (represented as pseudo-antlr):
 >
 >r: INT {two digits}? (INT {three digits}? VALUE END_OF_FIELD)+
 >NEWLINE;
 >END_OF_FIELD: ';';
 >INT: '0'..'9';
 >VALUE: 'A'..'Z' | 'a'..'z' | '0'..'9' | ' ';
 >
 >How do I express the length requirements to the 
lexer/parser?  As
 >you pointed out, since the rule for VALUE is a superset of the
 >rule for INT, it's sucking up the largest text fragment.

Well, actually it won't for the rules above, since you left out 
the +s :)

There isn't really any way in ANTLR to specify cardinalities other 
than 0/1/many, short of simply repeating a fragment.  So you 
*could* do something like this:

fragment DIGIT: '0'..'9';
fragment THREEDIGITS: DIGIT+;   // this is just to define the 
token name
TWODIGITS
   : DIGIT DIGIT
     (DIGIT { $type = THREEDIGITS; })?
   ;

But that sort of thing gets messy quickly.  If you don't have any 
kind of distinctive end-of-number markers (such as whitespace), 
then you might be better off not using ANTLR at all; it's not 
really designed for simplistic parsing problems like this.

However one way that you could do it in ANTLR would be to create 
one token per character at lexing time and leave everything else 
up to the parser:

r: twoDigitInt (threeDigitInt value END_OF_FIELD)+;

twoDigitInt: DIGIT DIGIT;
threeDigitInt: DIGIT DIGIT DIGIT;
value: (ALPHANUM | DIGIT)+;

END_OF_FIELD: ';';
DIGIT: '0'..'9';
ALPHANUM: 'A'..'Z' | 'a'..'z' | ' ';