[antlr-interest] Lexer bug? (with test cases!)

Wed Oct 24 04:09:46 PDT 2007

At 23:05 24/10/2007, Terence Parr wrote:
 >The issue is that there is an implied (ANYTOKEN)* and really
 >wildcard following any rule.  I can look at it again sometime,
 >but i am unsure how to make antlr see farther into the fog of
 >"there are infinite unspecified char next".

But it doesn't need to care about that, surely?  Once it's entered 
a rule, it's happy that it will be able to produce a valid token 
(assuming the lookahead was sufficient to guarantee this), so the 
only bits it needs to worry about are embedded optional bits, 
which simply have a "match exactly or exit the rule" 
behaviour.  No need to look "past" the end of the token -- as soon 
as you see something that can't possibly match any of the 
alternatives currently "in play", then you exit the rule, having 
consumed only those characters of input that validly matched, 
leaving the stuff you couldn't match to try matching against a 
different rule later on.

 >Can you tell me how to compute the static LL lookahead?  If not, 

 >then trust that I know it's not possible or at least too much 
for my
 >brain. ;)  It's the fog that causes the problem for LL(*).

This is just off the cuff, so there are quite likely some holes in 
it, but:

First, expand and inline all subrule references.  Then classify 
the rule like so:
1. No optional components:
   'foo'
   Required lookahead = length of string, ie. 3
2. Optional components:
   'bar'?
   ('a' | 'b')*
   Required lookahead = 0 (this is illegal at the top level, but 
is valid as a subclause)
3. At-least-one loops (+ loop, required loop)
   ('a' | 'b')+
   Required lookahead = required lookahead of subclause, ie. 1
4. Alternates:
   'foo' | 'tu'
   Required lookahead = up to max(lookahead of each subclause)
   (This can be done incrementally; ie. in this example you first 
look at the first character and see if it's 'f', 't', or exit 
rule.  Once you know that, you can then match the rest of the 
string with 2 lookahead if it was an 'f' or 1 lookahead if it was 
a 't'.)
5. Basic sequentials:
   'foo' 'bar'
   Required lookahead = required lookahead of first + required 
lookahead of second
6. Any required element (nonzero lookahead) following a required 
loop:
   'a'+ 'b'
   Required lookahead = infinite

This process is recursive, with each subclause being similarly 
classifiable.

So for example, given the rule "DIGIT+ ('.' DIGIT+)?", this is 
broken down like so:
   - first expand to "('0'..'9')+ ('.' ('0'..'9')+)?"
   - top level: + loop, optional component
     - lookahead = (lookahead of subclause "'0'..'9'") + 0
   - subclause "'0'..'9'": single character range
     - lookahead = 1

Therefore the top level lookahead is 1 -- the Tokens rule only has 
to look at one character (to see if it's in '0'..'9') before it 
can unambiguously declare the input to be a NUMBER.

Next up, once inside NUMBER, we match as many digits as we 
can.  Once we encounter a non-digit, the + loop is complete and 
it's time to move on to the next subclause.  Since this is an 
optional clause, it is permitted to be entirely absent.
   - subclause "'.' ('0'..'9')+": single character, + loop
     - lookahead = 1 + (lookahead of subclause "'0'..'9'")
                 = 1 + 1
                 = 2

So we must look ahead two characters and verify that they are 
indeed a dot and a digit before we can unambiguously declare this 
subclause to be present.  (If they're not a dot and a digit, then 
the subclause is not present and we shouldn't consume anything 
more.)

And that's it -- we've figured out this whole rule now.  For a 
more problematic example:

   - rule "('0'..'9')+ '.' ('0'..'9')+"
     - minor variation on the last one, but it makes this a much 
more complicated case
     - + loop, single character, + loop
     - lookahead = infinite (by rule 6), then 1 + 1 = 2

Here there are no subclauses (other than "'0'..'9'", which is 
trivial) to break down, and we can't do any better than infinite 
lookahead.  Before we can decide that we've even got a NUMBER 
token we must be prepared to scan past a potentially infinite 
series of digits; as soon as we encounter any non-digit, though 
(terminating the + loop) the "infinite" portion of the lookahead 
is complete and we're left with a simple two-character lookahead 
decision again -- if it's a dot and another digit then it's a 
NUMBER, otherwise it isn't (and we need to avoid consuming any of 
those initial digits and try a different lexer rule instead).

 >Have you built a lex like creature before?  I have back in 1988.
 >I believe I'm remembering how it operates correctly.

No; I've written a few hand-rolled parsers/lexers before, but not 
a parser/lexer generator.  I bow to your experience there, but 
that doesn't mean I can't say when I think there might be a 
"better" way of doing something :)