[antlr-interest] trouble with lexer prediction DFA

Tue Dec 22 18:32:49 PST 2009

Hello,

I am encountering some trouble with my lexer. What I am trying to do is 
make a lexer that handles source text that is given to the lexer one 
line at a time (how Visual Studio works with language services, also 
what Sam Harwell has been doing). There are multiple types of tokens 
that can be split across lines, among them is the C style comment: 
/*foo*/  My main lexer finds the start of a /* comment, and then I 
switch to another lexer to identify the continuation or end of it. I 
tried using gated semantic predicates to turn on parts of my grammar 
when inside a multiline comment; that did not work too well either, but 
that's another story. I am using Antlr version 3.2 from Sep 23rd.
The following grammar produces a mTokens() prediction DFA that loops 
forever when given the test input '*/'  I assume this is a bug and 
unintended behavior. Or is my understanding of Antlr lacking (in which 
case an explanation would be appreciated)?
I tested in 3 different targets, Java, CSharp2, and Sam's CSharp3, they 
all loop forever. Turning on/off greedy and backtracking doesn't seem to 
help, I still get a bad mTokens() rule.  If I access the rules 
individually, through calls to mENDMULTILINECOMMENT() or 
mCONTINUEMULTILINECOMMENT(), they seem to work as expected.

In english, what I want the grammar to do, and what I think it should be 
doing:
ENDMULTILINECOMMENT: match zero or more of ('*' not followed by '/', or 
anything that's not end of line, end of file) followed by '*/'
CONTINUEMULTILINECOMMENT: match zero or more of ('*' not followed by 
'/', or anything that's not end of line, end of file) followed by end of 
line

Regardless, Antlr is really cool and the rest of my lexer works well. 
Thanks to Terence and the rest who have created it.

Thanks in advance,
Daniel

lexer grammar CommentLexer;
options {
language=Java;
}

ENDMULTILINECOMMENT
     :    (options{greedy=false;}:
             ('*' ~'/')=> '*'
             | ~('*' | ENDOFLINEFRAGMENT | ENDOFFILEFRAGMENT))*
         '*/'
     ;

CONTINUEMULTILINECOMMENT
     :    (options{greedy=false;}:
             ('*' ~'/')=> '*'
             | ~('*' | ENDOFLINEFRAGMENT | ENDOFFILEFRAGMENT))*
         ENDOFLINEFRAGMENT
     ;

fragment
ENDOFLINEFRAGMENT
     :    '\n' | '\u2029' | '\u2028'
     ;

fragment
ENDOFFILEFRAGMENT
     :    ('\u0000' | '\u001A')
     ;