[antlr-interest] Lexing C# strings; generated lexer is incorrect?

Wed Jun 28 02:58:10 PDT 2006

Hi,

While trying to modify a C# lexer, I ran into trouble. So, I reduced the
lexer to the smallest size where the problem still occurs (see below).
It seems to me that either I'm doing something very wrong, or the
antlr-generated lexer source is incorrect. If someone can enlighten me
which of those is the case, and what I can do about it, I'd be very
thankful ;)

The offending lexer code:

---- csharp.g -----------
class CSharpLexer extends Lexer;
options { k=2; } // Or 3 or 4, doesn't matter

STRING_LITERAL : QUOTE 'a' QUOTE
                 | '@' QUOTE 'a' QUOTE;
QUOTE          : "\"";
------ end csharp.g ----

This construct is supposed to match CSharp strings; I simplified it to
match only the exact string "a" and the C# 'verbatim' string notation
@"a". One would expect that this lexer rule certainly should *not* match
the sequence @a (at-sign followed by character a, without any quotes
being involved).

However, this is the generated CSharpLexer.java:

----- CSharpLexer.java ----
public Token nextToken() {
[..]
if ((LA(1)=='"'||LA(1)=='@') && (LA(2)=='"'||LA(2)=='a')) {
    mSTRING_LITERAL(true);
    theRetToken=_returnToken;
}
else if ((LA(1)=='"') && (true)) {
    mQUOTE(true);
    theRetToken=_returnToken;
}
---- end CSharpLexer.java -

This code clearly matches the combination '@a' as a STRING_LITERAL,
although that string certainly doesn't match the pattern! Setting
testLiterals (to true or false) doesn't make a difference, by the way.

This becomes really problematic when another rule is introduced, e.g.:

IDENTIFIER:	'@' 'a' // real version: STARTCHAR (ANYCHAR)+ etc.

(because identifiers in C# can also start with a '@'-sign). This results
in the following error message:

ANTLR Parser Generator   Version 2.7.6 (2005-12-22)   1989-2005
csharp.g: warning:lexical nondeterminism between rules STRING_LITERAL
and IDENTIFIER upon
csharp.g:     k==1:'@'
csharp.g:     k==2:'a'

So apparently, antlr has some internal model where the string '@a'
really matches the STRING_LITERAL token. Something doesn't seem right
here...

I expect that this may in some way be related to the lexical
lookahead/end-of-token stuff described here:
http://sds.sourceforge.net/src/antlr/doc/lexer.html#lexicallookahead -
because the 'QUOTE' token is a single character which can be followed by
any other token (i.e. "anything"); maybe it's impossible to distinguish
QUOTE from STRING_LITERAL then? However, if that is the case, Antlr
should already have complained about that in the first case (the lexer
without the IDENTIFIER rule) instead of generating code that matches
things that clearly do not match the specified pattern.

To summarize, my questions are: I want to lex C# strings (such as "A" or
@"A"), C# identifiers (such as A or @A) and recognize the QUOTE (") as a
standalone token, too. Question 1: Is that even possible (in a lexer)?
Question 2: If not (because of ambiguity caused by end-of-token
basically limiting lookahead), shouldn't Antlr have flagged my first
lexer definition as ambiguous, then? (rather than generating an
incorrect lexer)?

Attached: the offending lexer source + generated lexer (source).

Any help would be appreciated,

Wilke Havinga

-------------- next part --------------
class CSharpLexer extends Lexer;

options 
{
	k=2; // Was originally 4; doesn't matter
}	

STRING_LITERAL
	:	QUOTE 'a' QUOTE
	|	'@' QUOTE 'a' QUOTE
	;

QUOTE		:	"\""    ;

// Uncomment to get the ambiguity warning
//IDENTIFIER
//	:	'@' 'a';

-------------- next part --------------
A non-text attachment was scrubbed...
Name: CSharpLexer.java
Type: text/java
Size: 4252 bytes
Desc: not available
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20060628/5ada5394/CSharpLexer.bin