[antlr-interest] Fragment and error tokens are not generated by emit() -- breaks setting TokenLabelType

Thu Jul 9 13:17:12 PDT 2009

Jim Idle wrote:
> David-Sarah Hopwood wrote:
>> In order to change the token type (so that I can associate another
>> field with each token), I have overridden emit() in my lexer as shown
>> below:
[snipped]
>> The problem is that the generated code has compilation errors because
>> not all creation of token objects goes through emit(); there are some
>> direct uses of 'new CommonToken(...)':
>>
>> org\jacaranda\verifier\JacarandaLexer.java:4006: incompatible types
>> found   : org.antlr.runtime.CommonToken
>> required: org.jacaranda.verifier.JacarandaToken
>>   d = new CommonToken(input, Token.INVALID_TOKEN_TYPE,
>>           Token.DEFAULT_CHANNEL, dStart1982, getCharIndex()-1);
>>       ^
>>
>> The errors seem to occur in lexer rules where a child rule that is
>> a fragment is given a name (whether or not that name is used in an
>> action), for example: [snipped]
>
> Yes, I think you are correct.
> 
> The solution is either that return types of fragments in lexerRuleRef() 
> template, should be known to be CommonToken or that the hard coded new 
> CommonToken should use the template's labelType.

The latter assumes that the labelType has a constructor with
signature (CharStream input, int type, int channel, int start, int stop),
and makes it inconvenient for token creation to make use of any other
lexer state (since the lexer object is not passed into the token
constructor).

> Probably the latter, 
> but fragment labels are really for using the token to determine the 
> start and end of fragments spans and so on, rather than emit()'ing them, 
> so perhaps not.
> 
> However, if you separate the lexer and parser, and use TokenLabelType 
> only in the parser grammar, then I think it would work as you require, 
> even with labeled fragment rules.

Unfortunately not. I've since found other cases in which a CommonToken
gets placed on the token stream (e.g. due to a parse error), and then
the generated code for the parser (not lexer) fails at runtime when it
attempts to cast it to a JacarandaToken. This is much harder to work
around, because of the cast being in generated parser code. For the time
being, I've stopped using the TokenLabelType option and decided to live
with the tokens being a mixture of JacarandaToken and CommonToken, but
that's not an ideal solution.

// parser rule
unreservedIdentifier
  : id=Identifier^ { ... }
  ;

// generated parser code
JacarandaToken id=null;
...
// line 2003:
id=(JacarandaToken)match(input,Identifier,FOLLOW_Identifier_in_unreservedIdentifier3071);

// when used to parse input where Identifier is missing:
line 1:0 [unreservedIdentifier] missing Identifier at [@0,0:4='throw',<27>,1:0]
java.lang.ClassCastException: org.antlr.runtime.CommonToken cannot be cast
to org.jacaranda.verifier.JacarandaToken
        at
org.jacaranda.verifier.JacarandaParser.unreservedIdentifier(JacarandaParser.java:2003)

So, it seems like the only workable fix that would maintain compatibility
with the emit() recipe given in
<http://www.antlr.org/wiki/pages/viewpage.action?pageId=1844>, is to
eliminate all instances of 'new CommonToken(...)' from generated code,
and have all token objects created via emit(). Or else the recipe
could be changed to create all tokens via a factory method:

lexer::@members {
  public Token createToken(CharStream input, int type, int channel,
                           int start, int stop) {
    return new MyToken(input, type, channel, start, stop);
  }
}

(with the default emit() using this method, of course).

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com