[antlr-interest] MismatchedTokenException

Wed Dec 16 16:37:48 PST 2009

Marcin Rzeźnicki wrote:
> 2009/12/14 Marcin Rzeźnicki <marcin.rzeznicki at gmail.com>:
>> 2009/12/13 Jim Idle <jimi at temporal-wave.com>:
>>> This usually means that your lexer token numbers are out of sync with your
>>> parser tokens. Regen in correct order and make sure all tokens have been
>>> declared.
>>>
>> Umm, what if I work with combined grammar? And some of literals are 'inlined'?
> 
> I think I know what has been causing this problem but I am scratching
> my head. It seems that ANTLR lexer is, well, a strange beast.
> I have a rule, say
> CLASS
>   :
>   'class'
>   ;
> 
> and below
> 
> IDENTIFIER
>   :
>   {Character.isJavaIdentifierStart(input.LA(1))}?=> . (
> {Character.isJavaIdentifierPart(input.LA(1))}?=> . )*
>   ;
> 
> (the latter rule has been questioned here, but bear with me a while, I
> need it to explain my case)
> 
> Now, upon seeing input 'class' ANTLR matches IDENTIFIER because of
> this gating predicate. Well, 'class' would have been a valid
> identifier, of course but shouldn't it try to match 'class' based on
> rules precedence?

This seems to be an idiosyncrasy of how ANTLR lexers treat gated semantic
predicates. Although . can match the 'c' in 'class', it appears that ANTLR
doesn't recognize that because of the predicate. That is the reason for the
additional complexity in the rules that I posted earlier:

fragment IdentifierStartASCII
  : 'a'..'z'
  | 'A'..'Z'
  | '$'
  | '_'
  ;

fragment IdentifierPart
  : IdentifierStartASCII
  | '0'..'9'
  | { Character.isJavaIdentifierPart(input.LA(1)) }?
      { matchAny(); }
  ;

// This generates mIdentifierRest() used below.
fragment IdentifierRest
  : IdentifierPart*
  ;

IDENTIFIER
  : IdentifierStartASCII IdentifierRest
  | { if (!Character.isJavaIdentifierStart(input.LA(1))) {
        throw new NoViableAltException("identifier start", 0, 0, input);
      }
      matchAny(); mIdentifierRest(); }
  ;

Because the IdentifierStartASCII production declaratively lists the
possible start characters that are ASCII (and therefore might also
begin another rule), there is sufficient information for ANTLR to
generate a correct *and efficient* DFA. Also note that the fact that
the second alternative of IDENTIFIER is just a code block and does not
explicitly match anything, appears to prevent ANTLR from making some
incorrect inferences. (At least, the generated code is correct as far
as I can see, whereas minor variants give incorrect code.)

It could be argued that there are some ANTLR bugs here: what should
happen is that the predicate is hoisted into the DFA and so your original
code above should work.

On the question of whether to do it this way, or instead match a more
inclusive grammar and then check validity in the parser: I don't think
this is as clear-cut as Jim Idle makes it out to be. The above approach
has the advantage that you don't need to list the Unicode characters
that might be valid in identifiers, which helps to avoid a lexer size
explosion. How errors are reported as a separate issue; the code above
could easily do something else instead of throwing NoViableAltException,
and it could catch errors from mIdentifierRest(). Personally, I prefer
to leave the lexer grammar so that it matches only the desired language,
but overload the error handling methods so that the response to a
lexer error is to generate a token with the error information attached
to it.

-- 
David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 292 bytes
Desc: OpenPGP digital signature
Url : http://www.antlr.org/pipermail/antlr-interest/attachments/20091217/fbe0f394/attachment.bin