[antlr-interest] [BUG] 3.0b4 no complaint on parser reference to lexical fragment

Sun Nov 12 19:38:33 PST 2006

On 13. Nov 2006, at 4:02 , John B. Brodie wrote:

>
> and is this a feature or a bug?

this is a feature.

> i am trying to assert that this is a bug.

i realize ;)

> from the fact that "fragment X <...snip...> rule X cannot and does  
> not return a
> token by itself."

differentiate between 'rule X' and...

> we must conclude that "<...snip...>, thus passing X tokens to the  
> parser."
> shall *not* be permitted.

...token type X.

> i understand that the current antlr v3 implementation (3.b04) does not
> consider references to lexical fragments by the parser as an  
> error.  i am
> just trying to assert that this current implementation is problematic.

This has not been changed in the upcoming b5.

The key point to see here is that the parser does not "call" a lexer  
rule!
It merely reads from a token stream that is calling nextToken() in  
the lexer.
In which way the lexer ends up with a token to return is unspecified  
and this is a GoodThing(tm).
It means that you could use a lexer with a different internal  
structure (say different
rules) or even a non-ANTLR generated lexer. You could write a simple  
wrapper around flex
or hand-code a lexer, if you have special needs, such as performance.

So, even if it might look as the parser is calling rule X in the  
lexer class, it's not!
The parser isn't concerned with the lexer rules at all, it's just  
interested in the type
of a particular token (which is also called X). Maybe this  
overlapping of terminology
is the source of the confusion.

A rule X implies the token it returns to have the type X, but that is  
not enforced at all.
In the general case it's the exception to return a token with a  
different type, but sometimes
it's the easiest way out (like in lexing number literals).
I think it would be unnatural to forbid the use of token types  
induced by fragment rules,
there's no need to do that either.

When I stretch my mind a bit, I can even imagine that I'd actually  
want to emit tokens for
fragment rules. Although I realize that I might totally confuse the  
issue at hand right now, I
cannot refrain from writing this down ;)
Ok, what am I thinking?
Conside the following rules in the lexer:

FOO
   : start=ID c=C end=ID
	{ emit(FOO); emit(start); emit(c); emit(end); } // this is  
pseudocode, but i think you get what i mean
   ;	
fragment C : '0x01223'; // some magic thing that should not be normal  
token.
fragment ID : 'a'..'z' ('a'..'z'|'0'..'9')*;

Suppose you have built a lexer subclass that can emit multiple tokens  
for one lexer rule (ANTLR
by default emits a maximum of one token per lexer rule).
In the parser you'd like to receive multiple tokens when you  
reference FOO. You could write:

somerule : FOO ID C ID	; // FOO generates ID C ID even though the  
rules are fragments!

what have you won? you might get around fiddling with the token's  
text's in the parser, you could
possibly set up a finer control of token channels, etc.
This might be a bad idea, but is interesting nonetheless. ;)

I need coffee. Quick.

cheers,
-k

-- 
Kay Röpke
http://classdump.org/