[antlr-interest] Troubles lexing a decimal, (from an antlr beginner)

Wed Jul 25 02:40:20 PDT 2007

El 25/7/2007, a las 1:42, Igor Murashkin escribió:

> Thanks for all the help. I used a syntactic predicate like Jim  
> suggested and
> it seems to lex everything properly now. I wish I understood  more
> academically why my original lexing syntax didn't work, does ANTLR  
> not put
> the tokens back and backtrack when it fails to match a rule?

You hit the nail on the head right there.

>> DOT        : '.'   ;
>> INTEGER    :    Digit+;
>> DECIMAL    :    Digit+ '.' Digit+;

Reading that as a human you probably disambiguate those rules  
automatically and just think, "If it looks like a string of digits  
followed by a dot and some more digits then it's a decimal, otherwise  
if it's just a string of digits then it's an integer, and if it's  
just a dot then it's just a dot". So basically you are adding  
syntactic predicates automatically without thinking about it, and  
syntactic predicates are a form of backtracking.

But ANTLR does not backtrack unless you explicitly tell it to. I  
think the main reason people don't expect this behaviour is that many  
(myself included) have lived for years with Perl-compatible regular  
expressions which *do* automatically backtrack if necessary in order  
to produce a match.

So the first thing you have to do is jettison your ideas and try to  
start with a completely blank slate. Forget everything you thought  
you knew and figure out how ANTLR "thinks". In general, ANTLR's  
strategy consists of two phases: firstly trying to cheaply and  
quickly predict what will match and secondly actually matching it.  
Backtracking is avoided at all costs, and ANTLR always tries to use  
the minimal "lookahead" necessary to make a prediction.

Things are further complicated by the fact that grammar mistakes can  
manifest themselves at two points in time: firstly, ANTLR has to  
analyse your grammar at compile time in order to figure out what  
prediction machinery to use, so that means you have "compile"-time  
analysis errors (ambiguities, unreachable alternatives, non-LL  
decisions etc); but you could also have runtime errors (where even  
for correctly-formed input, the recognizer doesn't do what you think  
it should do). So it's a complicated and subtle problem.

Get and read the ANTLR book if you can, and experiment with simple  
test cases in ANTLRWorks and study the generated code until you feel  
that you fully understand what ANTLR is doing under the covers.

Cheers,
Wincent