[antlr-interest] Lexerproblem with wildcard matching

Fri Feb 22 06:32:21 PST 2008

Dear ANTLR users and professionals,

I'm currently working on a source-to-source compiler that I'd like to
work with preprocessing directives. My goal is to keep the actual
preprocessing independent of the language that it is embedded in, i. e.
I'm only interested in parsing the preprocessor lines, the rest in
between I need to track but I don't want to parse. I'm using ANTLRv3 and
ANTLRworks v1.1.7.

For example, I'd like to parse a file like this:
==
bla bla
#specialprefix
bla bla
==

My wish is to get a token list from the lexer like SOMELINE PREFIX
SOMELINE, so I wrote a grammar in the following way:

==
grammar Test;
options { output=AST; }

att	: attline* ;

attline	: PREFIX TERMINATOR | SOMELINE ;

PREFIX	: '#specialprefix' { System.out.println("PREFIX"); } ;

WS	: ( ' ' | '\t' | '\f' ) { $channel=HIDDEN; } ;

TERMINATOR : ( '\n' | '\r' | '\r\n' | '\n\r' ) ;

SOMELINE	: .* TERMINATOR ;
==

(This is just the tricky little part of a bigger grammar with more
useful stuff in the #specialprefix line.)

Unfortunately, the SOMELINE rule eats my #specialprefix line as well ...
my System.out.println("PREFIX") is never called.

Do you guys already have an idea what I'm doing wrong?

I did some further investigation with the generated lexer code, and I
found something that looks odd to me: In the mTokens() method, there is
code generated to look ahead the character stream '#specialprefix',
character by character. Now the last character that should match is 'x'
before calling mPREFIX(), right? Well, the lexer continues one
character:
if ( (LA3_18=='t') ) {
  int LA3_19 = input.LA(12);
  if ( ((LA3_19>='\u0000' && LA3_19<='\uFFFE')) ) {
    alt3=4;
  } else {
    alt3=1;
  }
}
(The code is from the real example and not from the grammar above, but
the underlying problem is the same.) When alt3 is 1, then the correct
call to mPREFIX() follows, but alt3 is set to 4 which yields
mSOMELINE(), because the if clause before gets its condition evaluated
to true ... why is the lexer trying to match \uFFFF (obviously)?

Thanks for your help.

Best regards,
Thomas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Dies ist ein digital signierter Nachrichtenteil
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20080222/dc748370/attachment.bin