[antlr-interest] Re: Regular expression "repetition"

Mon May 17 14:05:23 PDT 2004

> This will probably do when the number of repetitions are low - but I
> am facing a problem with r{0,63} and I hope there is another way :-)

Well, just to be geeky, there is this approach:

r1: (options{greedy=true;}: r)? ;
r2: r1 r1 ;
r4: r2 r2 ;
r8: r4 r4 ;
r16: r8 r8 ;
r32: r16 r16;
r63: r32 r16 r8 r4 r2 r1;

Yes - this runs through Antlr w/o warning!  And it generalizes to any 
range of numbers for the repeat.

BUT - When people ask for r{x,y}, I always wonder if that is really 
what their grammar wants.  Consider this fragment of a grammar for 
reading byte values, assuming we had the r{x,y} syntax:

bytes: (BYTE)+ ;

BYTE: DIGIT{1,3} ;
protected DIGIT: '0'..'9' ;
WS: (' ' | '\t')+ { $setType(SKIP); } ;
NL: '\n' { newLine(); $setType(SKIP); } ;

Someone too cleverly spec'd the values to be between one and three 
decimal digits because that is what fits in a byte.  This doesn't work 
well in practice:
	"1"    --> [ 1 ]      parses as one byte
	"1 2"  --> [ 1, 2 ]   parses as two
	"12"   --> [ 12 ]     of course this parses as one
	"123"  --> [ 123 ]    ditto
	"1234" --> [ 123, 4 ] is this what any user would expect?
Really, any user expects to see a parse error: "1234, value too big for 
a byte".

In this case, the {1,3} is really expressing a semantic constraint 
(values must fit in bytes), not a syntactic one.  Trying to write 
semantic constraints as syntactic ones rarely works.  In the case of 
the byte example you can see easily how it fails: "456" parses, but 
doesn't fit in a byte, and changing the grammar so it parses as [45, 6] 
is just plain perverse and sure to vex your users.

I have found that it is often much more useful, both for the grammar 
and for the user to express size limits (on characters in identifiers, 
in number of digits for numbers, or repeats of some rule) as semantic 
constraints: Write the grammar to accept any number at all, and then 
generate an error for the user if the limits are exceeded or not met.  
Consider:

ID: LETTER{1,8} ;
protected LETTER: 'a'..'z' ;

Does anyone expect "subtotals" to parse as two IDs?

In your case, have you considered what a run of 64 r structures should 
be?  It is just an error, or is really a structure of 63 r.s followed 
by 1 r?

	- Mark

Mark Lentczner
markl at wheatfarm.org
http://www.wheatfarm.org/

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/antlr-interest/

<*> To unsubscribe from this group, send an email to:
     antlr-interest-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/