[antlr-interest] matching the longest token in the lexer
Terence Parr
parrt at jguru.com
Wed Dec 26 12:40:02 PST 2001
Howdy,
I have resolved the biggest remaining issue in my 'to do' list for
2.7.2: how to handle an ambiguous lexer construct that people expect to
be matched properly. Recall the discussion concerning k=3 problem where
input "**" (two stars) did not work as expected; i.e., it did not lex as
two STAR tokens, but rather ANTLR claimed a failed TRIPLE_STAR:
STAR : '*' ;
TRIPLE_STAR : "***" ;
Here's a piece of our conversation:
> > Terence grunted:
> > Heh, I just realized that it's ambiguous not nondeterministic. I mean
> > that *** is both *** token and 3 * tokens. The language itself is
> > ambiguous.
>
> "Bogdan Mitu" <bogdan at nobugconsulting.ro> hath spoken:
> Ter, the "stars" language is not more ambiguous then Java or many other
> languages. In Java we have ">>>", which is both >>> and 3 > tokens, to
> quote
> you ;-). The salvation is that ">>" is also a token, if we remove it,
> the
> rules '>' and ">>>" would have to be rewritten.
So, the STAR vs TRIPLE_STAR language is strictly speaking ambiguous, but
so is if-then-else and > vs >>> etc... in Java. In those cases, the
parser/lexer just happen to work properly by "matching input as soon as
possible."
In the STAR case, you need to resolve the ambiguity by having the
language manual and lexer say "match the longest possible char sequence
(token)", which is not what ANTLR does by default in all cases. The
"match as soon as possible" rule often coincides with this rule and
therefore works most of the time.
You expect that k=3 should see *** and say TRIPLE_STAR not 3 STAR
tokens. DFA based lexers such as lex, flex, etc... all implement a form
of backtracking to do this "match longest" thing. As the characters
direct the automaton, at least the most recent "accept state" is tracked
and lexing continues if the input would allow another transition. If
that continuation fails, the lexer simply rolls back to the last known
good accept state. So, DFA lexers backtrack to solve this problem.
In principle, finite static lookahead of k=3 can solve this problem as
in the patch sent to me by Bogdan himself. Unfortunately, it only fixes
the degenerate case of "match longest" where an alternative is a proper
substring of the another. I feel strongly that it would surprise people
that it worked there and not in other nearly cases. I also worry about
changes to the analysis as I have no good set of unit tests for it. :(
My philosophy with ANTLR has always been to "let people say what they
mean" and not try to do too much automatic stuff that makes ANTLR either
inefficient in many cases or simply mysterious. [The other day, I
flicked an inappropriate sequence of buttons on my Volvo remote entry
thing and it entered an unanticipated state in the car's automaton. It
took me 15 minutes to figure out how to get the stupid car into a known
state and to get the interior lights to shut off etc... One would go
off but not the other and then vice versa. I hate things that try to be
smart with all of this automatic light stuff. If I want the lights on,
by gawd, I'll turn the bastards on!].
Ahem...where was I? ;) Oh right. Fortunately, you can simply tell
ANTLR to do what other lexers do in this situation: backtrack.
STAR
: ("***") => "***" {$setType(TRIPLE_STAR);}
| '*'
;
Now, you are really explicitly stating that you intend 3 stars to match
TRIPLE_STAR, a formal means to resolve the language ambiguity. The best
part is that ANTLR will eventually handle these simple, fixed syn preds
by using finite lookahead (i.e., w/o backtracking). Regardless, ANTLR
generates something that is pretty efficient already. In the case of
single star, it immediately jumps to alt 2. In case of two stars, it
enters a try block to match ***, which fails costing us an exception
throw; it falls to alt 2, matching one of the stars. For 3 stars, it
costs us two match("***") calls but no exception. It works and is not
that expensive. :)
Here is what ANTLR (partially) generates:
boolean synPredMatched3 = false;
if (((LA(1)=='*') && (LA(2)=='*'))) {
int _m3 = mark();
synPredMatched3 = true;
inputState.guessing++;
try {
match("***");
}
catch (RecognitionException pe) {
synPredMatched3 = false;
}
rewind(_m3);
inputState.guessing--;
}
if ( synPredMatched3 ) {
match("***");
if ( inputState.guessing==0 ) {
_ttype = TRIPLE_STAR;
}
}
else if ((LA(1)=='*') && (true)) {
match('*');
}
In summary, after much thought, I have decided not to alter the analysis
for this special case.
Best regards,
Terence
--
Chief Scientist & Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
More information about the antlr-interest
mailing list