[antlr-interest] Trouble with nondeterminism

Ric Klaren ric.klaren at gmail.com
Tue Aug 29 04:21:08 PDT 2006


On 8/29/06, John Gruenenfelder <johng at as.arizona.edu> wrote:
> Okay, that certainly makes sense.  I guess my question is what is the best way
> to resolve these problems with common left prefixes, as one gets with
> different numerical types.

Left factoring is the solution. (or using predicates but that's a only
if nothing else helps solution)

One thing to keep in mind with lexers is that the lexer is made up of
a big automatically generated nextToken rule that gives an alternative
for all the rules in the lexer. (Hence the ambiguity warnings between
int/double)

> After re-reading the section in the reference manual about predicates, I have
> to following which generates no warnings from ANTLR:
>
> Constant
>     :   ( ('0'..'9')+ '.') => ('0'..'9')+ '.' ('0'..'9')* (Exponent)?
>             { $setType(DOUBLE); }
>     |   '.' ('0'..'9')+ (Exponent)?
>             { $setType(DOUBLE); }
>     |   ( ('0'..'9')+ ('e' | 'E')) => ('0'..'9')+ Exponent
>             { $setType(DOUBLE); }
>     |   ('0' | ( '1'..'9' ('0'..'9')* ))
>             { $setType(INT); }
>     ;
>
> Is that a "good" method for dealing with this problem?

It works... but for every invocation of the constant rule it will for
the first alternative read input untill ( ('0'..'9')+ '.') => .. is
satisfied or failed. In the succes case it will rollback the input and
enter the first alternative. In the failed case it will attempt to
match the '.' for the 2nd alternative if that does not work it will
try the predicate ( ('0'..'9')+ ('e' | 'E')) => .. read input untill
the predicate fails or succeeds. If it succeeds it will roll back the
input and enter the alternative. If it fails it will rollback the
input and enter the last alternative.

=> introduces a trial and error run of the input (or backtracking).
E.g. it's very expensive in terms of efficiency. In terms of
readability or speed of implementing it's pretty efficient ;)

The other solution to this problem is to left factor the rule:

Constant:
   { $setType(INT); /* int untill proven otherwise */ }
   ('0' | ( '1'..'9' ('0'..'9')* ))
   (
         (  '.' ('0'..'9')* (Exponent)? )
         { $setType(DOUBLE); }
      |  Exponent
         { $setType(DOUBLE); }
   )?
   |   '.' ('0'..'9')+ (Exponent)?
      { $setType(DOUBLE); }
   ;

This should do the trick I think (unless I made a type/thinko)

Left factoring usually leads to less readable code. But it's a lot
faster. In a lexer you'd want to try to have no => constructs. But
during prototyping it's quite handy.

> I must also say that
> even after reading that section and hacking together the above rules, I still
> don't really understand how these predicate rules help ANTLR do its job.  It
> still must grab the characters one at a time.  How do the predicates make the
> task easier or, at the very least, unambiguous?

One thing to understanding antlr is reading the generated code (no
worries it's quite readable). For kicks safe the code generated for
the backtracking version then generate the code for the left factored
version and compare. (or generate code with -traceLexer and then
compile and run) )

Once you get a feel for it it's really intuitive.

Cheers,

Ric


More information about the antlr-interest mailing list