[antlr-interest] Newbie conundrums

Tue Aug 21 14:20:38 PDT 2007

Hi all,

I'm fairly new to ANTLR and parser generators in general, so maybe my
problems are obvious ones. Any feedback is much appreciated though :-)

Consider a text consisting of three blocks. The first is a code of
exactly 30 digits. This is followed by a time in the hh:mm:ss format.
Finally there is a number of unspecified length. The goal is to get
all three values.

(Note: I know that ANTLR is overkill for this problem. This is pure exercise.)

The grammar below describes this text:

grammar Silly;

line : CODE TIME NUMBER ;

CODE : D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D ;
TIME : D D C D D C D D ;
NUMBER : D+ ;

fragment D : '0'..'9' ;
fragment C : ':' ;

Now, there are two things about this grammar that I find awkward and
one thing that really surprises me.

The first awkward thing is that I have to specify each individual
digit for the CODE rule. This would get really cumbersome if I needed
a really large number of digits. Is there maybe a regex-like notation
to specify the number of tokens? (In most regex dialects you can say
d{30} to specify 30 occurrances of 'd'.)

The second awkward thing is that this grammar doesn't work :-)

If I test it (in ANTLRWorks) on
"01234567890123456789012345678912:59:300123456789012345678901234567",
then everything before the ":" is matched by the NUMBER rule. Of
course this is completely logical for the lexer to do, but it still
sucks. The filter option (and giving precedence to CODE over NUMBER)
doesn't really solve the problem either, since it's perfectly valid to
have the number be of length 30 too, and in that scenario the number
would result in a CODE token.

What's the way to go about this? The context makes the choice
unambiguous, but to take that into account I have to pull CODE and
NUMBER into the parser and have the lexer produce single-digit tokens.
That sounds rather ugly. And not particularly efficient either :-(

Is there a better way? (Predicates?)

Finally, the surprising thing is that if I try this parser (again in
ANTLRWorks) on the following input below (notice the "fff" between the
code and the time), then it parses code, time, and number, without
complain, ignoring the "ffff" in the process. Is this automatic error
correction? If so, can I turn on warnings somewhere?

The input is: "012345678901234567890123456789fff12:59:300123456789012345678901234567"

Sorry if these issues are obvious or if I've done something stupid.
Much obliged for any feedback :-)

Cheers!
JW