[antlr-interest] Lexical error recovery by manual symbol (character) insertion/deletion?

Fri Feb 15 15:19:00 PST 2008

At 10:17 16/02/2008, Darach Ennis wrote:
>However the dummy-fragment approach is counter-intuitive at best 
>because the
>lexer rules FLOAT and NUMBER are what I'm using in my parser 
>rules section. But these
>are fragment rules... so this is bad practice!
>
>The fact that 'top-level' lexer rules can emit types and fragment 
>rules cannot isn't
>mentioned in that section of the book. The 'fragment' keyword 
>merely infers the
>intended (ie: not mandated) visibility of the rule. So the 
>'dummy-fragment approach'
>is more of a lucky side-effect adopted as a feature? Thankfully, 
>ANTLR doesn't
>complain when fragment rules are referred to directly in parser 
>rules...

Well, not quite.  What actually happens is that every named lexer 
rule (including those listed in the "tokens" block) gets assigned 
a token type id -- you can see this in the .tokens file that it 
generates.

At the parser level, you are not referring to rules as such, you 
are just saying "I am expecting a token with this type id".

Every lexer has a single automatic rule ("Tokens", with the 
generated code in the mTokens method) which is what really gets 
called to generate the token stream.  This rule automatically 
refers to all lexer rules (and non-empty "tokens" block entries), 
except for those marked as being "fragment"s.  The rules called by 
the Tokens rule are usually called "top level rules".

Each call to the mTokens method is expected to generate exactly 
one token; while by default the token type will match that of the 
top-level rule being called, but that rule can assign a different 
type if it wants to.  That's what is in play here.  Since all 
lexer rules (including fragments) have type ids, any of them are 
possible values.

>fragment DIGIT: '0'..'9'; // this is a genuine fragment rule
>imaginary NUMBER: DIGIT+; // this is an imaginary / special token
>imaginary FLOAT: NUMBER DOT NUMBER; // this is an imaginary / 
>special token
>generator INT_OR_FLOAT
>   :  NUMBER // => NUMBER is implicit here?
>      ( (DOT DIGIT) => DOT NUMBER { $type = FLOAT; } )?
>   ;

I agree.  I find it a bit irritating that I can't say "I'm 
creating this rule just for convenience; it doesn't need a token 
type id".  Although I'd probably be happier with something like 
this:

tokens { FLOAT; }          // imaginary: type id generated but no 
warning
fragment DIGIT: '0'..'9';  // fragment: no type id generated
fragment NUMBER: DIGIT+;   // again, no type id
INT                        // type id generated
   : NUMBER
     ( (DOT DIGIT) => DOT NUMBER { $type = FLOAT; } )?
   ;

This actually *is* valid syntax right now, but for some weird 
reason the use of purely imaginary tokens (those in the tokens 
block without bodies) started generating warnings for no apparent 
reason a while ago, and fragments have always generated type ids 
(and I don't think they should).

>       Because ANTLR assumes that all lexer rules are valid 
> tokens, you must prefix
>       factored "helper rules" with the fragment keyword. This 
> keyword tells ANTLR
>       that you intend for this rule to be called only by other 
> rules and that it
>       should not yield a token to the parser.
[...]
>       The token type (nonzero positive integer) of the token 
> such as INT;
>       translates to a call to getType().
>
>No mention of a restriction for fragment rules there, so one 
>should be able to manipulate the type information in this case.

No, I think you're misinterpreting.  The first paragraph basically 
says "rules marked with 'fragment' don't generate tokens", while 
the second says "$type holds the type id for the token that will 
be generated".  From the combination it should be evident that 
$type is not valid in fragment rules.

>So when I see the keyword 'fragment' it is explicitly 
>communicating 'do not refer to me in a parser rule'.

No, it's saying "I don't automatically generate a token", which is 
not quite the same thing.