[antlr-interest] 2 or 3 (major?) bugs in ANTLR (C#)

Thu Jul 5 22:10:08 PDT 2007

No matter what I do I can't seem to make a lexer that works correctly.
Weird things have been happening constantly, starting with the
multi-megabyte lexer that I always end up with if I use LL(*). So I've
been using LL(4) but am still running into a number of problems.

I've been testing with the demo lexer below. Before I show it, I want
to point out what's going on with the backslash:

- in my language, backslash is an escape for producing identifiers out
of punctuation, so "\&" is an identifier, as is "\a\b\c" as well as
ordinary things like "abc".
- backslash is punctuation (PUNC) if followed by a space. The space is
NOT included in the token.

By the way, should these rules be equivalent (behaviorally)? I think
they should:

BUG: ('_' | {false}? . | {false}? .)+; // Version 1
BUG: ('_' | {false}? .)+; // Version 2
BUG: ('_')+; // Version 3

But as you'll see, three different behaviors may result. Another thing
I've noticed is that ANTLR seems to use more lookahead than necessary
to do prediction, and because of this is somehow more likely to make
incorrect predictions.

I've isolated a small set of rules to demonstrate some of the problems
I've been having. Note that I usually use predicates like

	  {input.LA(2) == ' ' || input.LA(2) == '\t'}? '\\'

instead of

	  ('\\' (' ' | '\t')) => '\\'

because as discussed recently, the latter does not work. And it seems
that often the former doesn't work either. I've included examples of
both apparently failing. Here's the grammar:

/////////////////////////////////////////////////////////////////////////
lexer grammar Bug1;

options
{
	language = 'CSharp';
	k = 4;
	TokenLabelType = CommonToken;
}

@lexer::members {
	public bool IsWS(int LA)        { int c = input.LA(LA); return c == '
' || c == '\t'; }
	public bool IsNewline(int LA)   { int c = input.LA(LA); return c ==
'\r' || c == '\n'; }
	public bool IsCtrlChar(int LA)  { int c = input.LA(LA); return c < 32; }
	public bool IsBackslash(int LA) { int c = input.LA(LA); return c == '\\'; }
}

WS: WS_CHAR+;
fragment WS_CHAR: ' ' | '\t';

BUG: ('_' | {false}? . | {false}? .)+; // Version 1
//BUG: ('_' | {false}? .)+; // Version 2
//BUG: ('_')+; // Version 3

ID: ID_LETTER+;
fragment ID_LETTER
	: ('a'..'z' | 'A'..'Z')
	| {IsBackslash(1) && !IsWS(2) && !IsCtrlChar(2)}?=> '\\' .
	;

PUNC: PUNC_CHAR+;
fragment PUNC_CHAR:
	  {IsWS(2)}? '\\'                     // Version 1
	//{IsBackslash(1) && IsWS(2)}?=> '\\' // Version 2
	//('\\' WS_CHAR)=>'\\'                // Version 3
	| (':'|'.'|'~'|'!'|'@'|'$'|'%'|'^'|'&'
	  |'*'|'-'|'+'|'='|'|'|','|'<'|'>'|'?');
/////////////////////////////////////////////////////////////////////////

Next, here's the testing code:

/////////////////////////////////////////////////////////////////////////
public static void Main(string[] args)
{
	ParseBug(@"ab");
	ParseBug(@"ab& ");
	ParseBug(@"ab&\");
	ParseBug(@"ab&\cd");
	ParseBug(@"ab\ cd");
}
static void ParseBug(string s)
{
	System.Console.WriteLine(s);
	ANTLRStringStream input = new ANTLRStringStream(s);
	Lexer lexerBug = new Bug1Lexer(input);
	IToken t;
	while ((t = lexerBug.NextToken()).Type != Bug1Lexer.EOF) {
		System.Console.WriteLine("{0} <{1}>",
			FooParser.tokenNames[t.Type], t.Text);
	}
	System.Console.WriteLine("");
}
/////////////////////////////////////////////////////////////////////////

N.B. All of these test strings should work, but they don't. ANTLR
gives no warnings. The output is:

ab
ID <ab>

ab&
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
PUNC <&>

ab&\
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
line 1:3 rule PUNC_CHAR failed predicate: {input.LA(2) == ' ' ||
input.LA(2) == '\t'}?

ab&\cd
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
line 1:2 required (...)+ loop did not match anything at character '&'
ID <\cd>

ab\ cd
ID <ab>
line 1:2 no viable alternative at character '\'
line 1:3 required (...)+ loop did not match anything at character ' '
ID <cd>
/////////////////////////////////////////////////////////////////////////

Now switch to Version 2 of BUG. The output is:

ab
ID <ab>

ab&
line 1:0 rule BUG failed predicate: {false}?
line 1:1 rule BUG failed predicate: {false}?
PUNC <&>

ab&\
line 1:0 rule BUG failed predicate: {false}?
line 1:1 rule BUG failed predicate: {false}?
line 1:2 rule BUG failed predicate: {false}?
line 1:3 no viable alternative at character '\'
WS < >

ab&\cd
line 1:0 rule BUG failed predicate: {false}?
line 1:1 rule BUG failed predicate: {false}?
line 1:2 rule BUG failed predicate: {false}?
ID <\cd>

ab\ cd
ID <ab>
line 1:2 no viable alternative at character '\'
line 1:3 rule BUG failed predicate: {false}?
ID <cd>
/////////////////////////////////////////////////////////////////////////

It is worth noting that this grammar is LL(1) because all other
lookahead is done with semantic predicates. However, the above
symptoms only occur if k>1. If k=2 then two errors remain; if k=1 then
the result is the same as when using Version 3 of BUG.

Without further ado, here's Version 3 of BUG:

ab
ID <ab>

ab&
ID <ab>
PUNC <&>

ab&\
ID <ab>
PUNC <&\>
WS < >

ab&\cd
ID <ab>
line 1:3 rule PUNC_CHAR failed predicate: {IsWS(2)}?
ID <cd>

ab\ cd
ID <ab>
PUNC <\>
WS < >
ID <cd>
/////////////////////////////////////////////////////////////////////////

That's much better! But there's a problem with the predicate because
it is not getting hoisted for some reason. mPUNC() sees a backslash
and decides that's all it needs to run mPUNC_CHAR. In fact,
mPUNC_CHAR() itself doesn't run the predicate until after it's
predicted the result. It escapes me why it would decide that the
alternative succeeds, then runs the predicate afterward - which fails,
of course.

Now let's try Version 3 of PUNC_CHAR - something slightly different happens:

ab
ID <ab>

ab&
ID <ab>
PUNC <&>

ab&\
ID <ab>
PUNC <&\>
WS < >

ab&\cd
ID <ab>
line 1:3 no viable alternative at character '\'
ID <cd>

ab\ cd
ID <ab>
PUNC <\>
WS < >
ID <cd>
/////////////////////////////////////////////////////////////////////////

This time, mPUNC decides that a backslash alone is enough to predict a
backslash, so it calls mPUNC_CHAR. mPUNC_CHAR actually bothers to run
the predicate and, of course, it fails, causing a no viable alt error.
Only Version 2 works as it should:

ab
ID <ab>

ab&
ID <ab>
PUNC <&>

ab&\
ID <ab>
PUNC <&\>
WS < >

ab&\cd
ID <ab>
PUNC <&>
ID <\cd>

ab\ cd
ID <ab>
PUNC <\>
WS < >
ID <cd>
/////////////////////////////////////////////////////////////////////////

This concludes my little bug report.
-- 
- David
http://qism.blogspot.com