[antlr-interest] Lexical error recovery by manual symbol (character) insertion/deletion?

Fri Feb 15 13:17:10 PST 2008

Hi Gavin,

Thanks for the clarifications and historical information.

However the dummy-fragment approach is counter-intuitive at best because the
lexer rules FLOAT and NUMBER are what I'm using in my parser rules section.
But these
are fragment rules... so this is bad practice!

The fact that 'top-level' lexer rules can emit types and fragment rules
cannot isn't
mentioned in that section of the book. The 'fragment' keyword merely infers
the
intended (ie: not mandated) visibility of the rule. So the 'dummy-fragment
approach'
is more of a lucky side-effect adopted as a feature? Thankfully, ANTLR
doesn't
complain when fragment rules are referred to directly in parser rules...

The following makes more sense, at least to me, but is of course completely
illegal syntax:

fragment DIGIT: '0'..'9'; // this is a genuine fragment rule
imaginary NUMBER: DIGIT+; // this is an imaginary / special token
imaginary FLOAT: NUMBER DOT NUMBER; // this is an imaginary / special token
generator INT_OR_FLOAT
  :  NUMBER // => NUMBER is implicit here?
     ( (DOT DIGIT) => DOT NUMBER { $type = FLOAT; } )?
  ;

I appreciate there may be technical reasons for restricting fragment rules
such that
they can not emit tokens but the notion of a fragment (lexer-visible, not
parser-visible)
and a type-generating (may emit tokens) rule should not necessarily be
mutually exclusive.

To quote from the definitive ANTLR bible, page 92-93 in the 'Fragment Lexer
Rules' section:

      Because ANTLR assumes that all lexer rules are valid tokens, you must
prefix factored
      "helper rules" with the fragment keyword. This keyword tells ANTLR
that you intend for
      this rule to be called only by other rules and that it should not
yield a token to the parser.

Similarly, on page 127, in the token/lexer rule attribute table the
attribute $type is described as:

      The token type (nonzero positive integer) of the token such as INT;
translates to a call to getType().

No mention of a restriction for fragment rules there, so one should be able
to manipulate the type
information in this case.

So when I see the keyword 'fragment' it is explicitly communicating 'do not
refer to me in a parser
rule'. The dummy-fragment approach is breaking this rule and so I
(innocently) overlooked any solution using
them. The bible deems this a sin, but as this is an accepted means to an
end, call me a sinner!

I'll take the approach, because it works, but i'll document every single
usage in my lexer rules
because it is relatively non-standard and non-documented unless you use
ANTLR every day.

Regards,

Darach.

On Fri, Feb 15, 2008 at 7:42 PM, Gavin Lambert <antlr at mirality.co.nz> wrote:

> At 04:34 16/02/2008, Darach Ennis wrote:
> >After some trial and error and a little brain-stretching the
> >following seems to work:
> >
> >F:   ('0' | '1'..'9' '0'..'9'*)
> >     (
> >         { input.LA(1) == '.' && Character.isDigit(input.LA(2))
> > }?=> ('.' '0'..'9'+) { _type = F; }
> >         |   { _type = I; }
> >     )
> >     ;
>
> First: don't use _type (that's an implementation detail).  Use
> $type instead.
>
> Second: solutions to this issue have been posted several times
> before; a common alternative solution is:
>
> fragment DIGIT: '0'..'9';
> fragment NUMBER: DIGIT+;
> fragment FLOAT: NUMBER DOT NUMBER;
> INT
>   :  NUMBER
>      ( (DOT DIGIT) => DOT NUMBER { $type = FLOAT; } )?
>   ;
>
> (Or you could replace that first NUMBER in the INT rule with ('0'
> | '1'..'9' DIGIT*) if you wanted to ensure leading zeros were
> invalid.)
>
> The actual contents of the FLOAT rule don't matter, though it's
> usually preferable to make it look similar to what it's going to
> represent.
>
> FLOAT can actually be put into the tokens section instead, but
> only if it has no content (since if it has content it becomes a
> top-level rule, which isn't the goal); unfortunately doing this
> causes ANTLR to emit a warning at present, which is why the dummy
> fragment approach is usually preferred.
>
> >The _type field should be defined in lexer fragment rules so that
> >ambiguity such as the above can be resolved without making a rule
> >public.
>
> Lexer fragment rules never emit tokens, so $type is completely
> meaningless for them.  Any type-juggling must be done in the
> top-level rule.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.antlr.org/pipermail/antlr-interest/attachments/20080215/486e1818/attachment.html