[antlr-interest] Lexical error recovery by manual symbol (character) insertion/deletion?
Gavin Lambert
antlr at mirality.co.nz
Fri Feb 15 15:19:00 PST 2008
At 10:17 16/02/2008, Darach Ennis wrote:
>However the dummy-fragment approach is counter-intuitive at best
>because the
>lexer rules FLOAT and NUMBER are what I'm using in my parser
>rules section. But these
>are fragment rules... so this is bad practice!
>
>The fact that 'top-level' lexer rules can emit types and fragment
>rules cannot isn't
>mentioned in that section of the book. The 'fragment' keyword
>merely infers the
>intended (ie: not mandated) visibility of the rule. So the
>'dummy-fragment approach'
>is more of a lucky side-effect adopted as a feature? Thankfully,
>ANTLR doesn't
>complain when fragment rules are referred to directly in parser
>rules...
Well, not quite. What actually happens is that every named lexer
rule (including those listed in the "tokens" block) gets assigned
a token type id -- you can see this in the .tokens file that it
generates.
At the parser level, you are not referring to rules as such, you
are just saying "I am expecting a token with this type id".
Every lexer has a single automatic rule ("Tokens", with the
generated code in the mTokens method) which is what really gets
called to generate the token stream. This rule automatically
refers to all lexer rules (and non-empty "tokens" block entries),
except for those marked as being "fragment"s. The rules called by
the Tokens rule are usually called "top level rules".
Each call to the mTokens method is expected to generate exactly
one token; while by default the token type will match that of the
top-level rule being called, but that rule can assign a different
type if it wants to. That's what is in play here. Since all
lexer rules (including fragments) have type ids, any of them are
possible values.
>fragment DIGIT: '0'..'9'; // this is a genuine fragment rule
>imaginary NUMBER: DIGIT+; // this is an imaginary / special token
>imaginary FLOAT: NUMBER DOT NUMBER; // this is an imaginary /
>special token
>generator INT_OR_FLOAT
> : NUMBER // => NUMBER is implicit here?
> ( (DOT DIGIT) => DOT NUMBER { $type = FLOAT; } )?
> ;
I agree. I find it a bit irritating that I can't say "I'm
creating this rule just for convenience; it doesn't need a token
type id". Although I'd probably be happier with something like
this:
tokens { FLOAT; } // imaginary: type id generated but no
warning
fragment DIGIT: '0'..'9'; // fragment: no type id generated
fragment NUMBER: DIGIT+; // again, no type id
INT // type id generated
: NUMBER
( (DOT DIGIT) => DOT NUMBER { $type = FLOAT; } )?
;
This actually *is* valid syntax right now, but for some weird
reason the use of purely imaginary tokens (those in the tokens
block without bodies) started generating warnings for no apparent
reason a while ago, and fragments have always generated type ids
(and I don't think they should).
> Because ANTLR assumes that all lexer rules are valid
> tokens, you must prefix
> factored "helper rules" with the fragment keyword. This
> keyword tells ANTLR
> that you intend for this rule to be called only by other
> rules and that it
> should not yield a token to the parser.
[...]
> The token type (nonzero positive integer) of the token
> such as INT;
> translates to a call to getType().
>
>No mention of a restriction for fragment rules there, so one
>should be able to manipulate the type information in this case.
No, I think you're misinterpreting. The first paragraph basically
says "rules marked with 'fragment' don't generate tokens", while
the second says "$type holds the type id for the token that will
be generated". From the combination it should be evident that
$type is not valid in fragment rules.
>So when I see the keyword 'fragment' it is explicitly
>communicating 'do not refer to me in a parser rule'.
No, it's saying "I don't automatically generate a token", which is
not quite the same thing.
More information about the antlr-interest
mailing list