[antlr-interest] Troubles lexing a decimal, (from an antlr beginner)
Wincent Colaiuta
win at wincent.com
Wed Jul 25 02:40:20 PDT 2007
El 25/7/2007, a las 1:42, Igor Murashkin escribió:
> Thanks for all the help. I used a syntactic predicate like Jim
> suggested and
> it seems to lex everything properly now. I wish I understood more
> academically why my original lexing syntax didn't work, does ANTLR
> not put
> the tokens back and backtrack when it fails to match a rule?
You hit the nail on the head right there.
>> DOT : '.' ;
>> INTEGER : Digit+;
>> DECIMAL : Digit+ '.' Digit+;
Reading that as a human you probably disambiguate those rules
automatically and just think, "If it looks like a string of digits
followed by a dot and some more digits then it's a decimal, otherwise
if it's just a string of digits then it's an integer, and if it's
just a dot then it's just a dot". So basically you are adding
syntactic predicates automatically without thinking about it, and
syntactic predicates are a form of backtracking.
But ANTLR does not backtrack unless you explicitly tell it to. I
think the main reason people don't expect this behaviour is that many
(myself included) have lived for years with Perl-compatible regular
expressions which *do* automatically backtrack if necessary in order
to produce a match.
So the first thing you have to do is jettison your ideas and try to
start with a completely blank slate. Forget everything you thought
you knew and figure out how ANTLR "thinks". In general, ANTLR's
strategy consists of two phases: firstly trying to cheaply and
quickly predict what will match and secondly actually matching it.
Backtracking is avoided at all costs, and ANTLR always tries to use
the minimal "lookahead" necessary to make a prediction.
Things are further complicated by the fact that grammar mistakes can
manifest themselves at two points in time: firstly, ANTLR has to
analyse your grammar at compile time in order to figure out what
prediction machinery to use, so that means you have "compile"-time
analysis errors (ambiguities, unreachable alternatives, non-LL
decisions etc); but you could also have runtime errors (where even
for correctly-formed input, the recognizer doesn't do what you think
it should do). So it's a complicated and subtle problem.
Get and read the ANTLR book if you can, and experiment with simple
test cases in ANTLRWorks and study the generated code until you feel
that you fully understand what ANTLR is doing under the covers.
Cheers,
Wincent
More information about the antlr-interest
mailing list