[antlr-interest] matching the longest token in the lexer

Wed Dec 26 12:40:02 PST 2001

Howdy,

I have resolved the biggest remaining issue in my 'to do' list for 
2.7.2: how to handle an ambiguous lexer construct that people expect to 
be matched properly.  Recall the discussion concerning k=3 problem where 
input "**" (two stars) did not work as expected; i.e., it did not lex as 
two STAR tokens, but rather ANTLR claimed a failed TRIPLE_STAR:

STAR : '*' ;
TRIPLE_STAR : "***" ;

Here's a piece of our conversation:

> > Terence grunted:
> > Heh, I just realized that it's ambiguous not nondeterministic.  I mean
> > that *** is both *** token and 3 * tokens.  The language itself is
> > ambiguous.
>
> "Bogdan Mitu" <bogdan at nobugconsulting.ro> hath spoken:
> Ter, the "stars" language is not more ambiguous then Java or many other
> languages. In Java we have ">>>", which is both >>> and 3 > tokens, to 
> quote
> you ;-). The salvation is that ">>" is also a token, if we remove it, 
> the
> rules '>' and ">>>" would have to be rewritten.

So, the STAR vs TRIPLE_STAR language is strictly speaking ambiguous, but 
so is if-then-else and > vs >>> etc... in Java.  In those cases, the 
parser/lexer just happen to work properly by "matching input as soon as 
possible."

In the STAR case, you need to resolve the ambiguity by having the 
language manual and lexer say "match the longest possible char sequence 
(token)", which is not what ANTLR does by default in all cases.  The 
"match as soon as possible" rule often coincides with this rule and 
therefore works most of the time.

You expect that k=3 should see *** and say TRIPLE_STAR not 3 STAR 
tokens.  DFA based lexers such as lex, flex, etc... all implement a form 
of backtracking to do this "match longest" thing.  As the characters 
direct the automaton, at least the most recent "accept state" is tracked 
and lexing continues if the input would allow another transition.  If 
that continuation fails, the lexer simply rolls back to the last known 
good accept state.  So, DFA lexers backtrack to solve this problem.

In principle, finite static lookahead of k=3 can solve this problem as 
in the patch sent to me by Bogdan himself.  Unfortunately, it only fixes 
the degenerate case of "match longest" where an alternative is a proper 
substring of the another.  I feel strongly that it would surprise people 
that it worked there and not in other nearly cases.  I also worry about 
changes to the analysis as I have no good set of unit tests for it. :(

My philosophy with ANTLR has always been to "let people say what they 
mean" and not try to do too much automatic stuff that makes ANTLR either 
inefficient in many cases or simply mysterious.  [The other day, I 
flicked an inappropriate sequence of buttons on my Volvo remote entry 
thing and it entered an unanticipated state in the car's automaton.  It 
took me 15 minutes to figure out how to get the stupid car into a known 
state and to get the interior lights to shut off etc...  One would go 
off but not the other and then vice versa.  I hate things that try to be 
smart with all of this automatic light stuff.  If I want the lights on, 
by gawd, I'll turn the bastards on!].

Ahem...where was I? ;)  Oh right.  Fortunately, you can simply tell 
ANTLR to do what other lexers do in this situation: backtrack.

STAR
     :    ("***") => "***" {$setType(TRIPLE_STAR);}
     |    '*'
     ;

Now, you are really explicitly stating that you intend 3 stars to match 
TRIPLE_STAR, a formal means to resolve the language ambiguity.  The best 
part is that ANTLR will eventually handle these simple, fixed syn preds 
by using finite lookahead (i.e., w/o backtracking).  Regardless, ANTLR 
generates something that is pretty efficient already.  In the case of 
single star, it immediately jumps to alt 2.  In case of two stars, it 
enters a try block to match ***, which fails costing us an exception 
throw; it falls to alt 2, matching one of the stars.  For 3 stars, it 
costs us two match("***") calls but no exception.  It works and is not 
that expensive. :)

Here is what ANTLR (partially) generates:

     boolean synPredMatched3 = false;
     if (((LA(1)=='*') && (LA(2)=='*'))) {
       int _m3 = mark();
       synPredMatched3 = true;
       inputState.guessing++;
       try {
         match("***");
       }
       catch (RecognitionException pe) {
         synPredMatched3 = false;
       }
       rewind(_m3);
       inputState.guessing--;
     }
     if ( synPredMatched3 ) {
       match("***");
       if ( inputState.guessing==0 ) {
         _ttype = TRIPLE_STAR;
       }
     }
     else if ((LA(1)=='*') && (true)) {
       match('*');
     }

In summary, after much thought, I have decided not to alter the analysis 
for this special case.

Best regards,
Terence
--
Chief Scientist & Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/