[antlr-interest] MismatchedTokenException

Thu Dec 17 08:11:12 PST 2009

On Wed, Dec 16, 2009 at 8:23 PM, Jim Idle <jimi at temporal-wave.com> wrote:
> I think that the problem is you are trying to use the gated predicate to continue consuming. Instead just use action code and then the gated predicate will just select the rule. Here is a working example:
>
> grammar T;
>
> aaa : rule+ EOF
>   ;
>
> rule
>  : classtok
>  | ident
>  ;
>
> classtok : CLASS;
> ident : IDENTIFIER;
>
> CLASS
>  :
>  'class'
>  ;
>
>
> IDENTIFIER
>  :
>  {Character.isJavaIdentifierStart(input.LA(1))}?=> . { while (Character.isJavaIdentifierPart(input.LA(1))) { input.consume(); } }
>  ;
>
>  WS : (' '|'\t'|'\n'|'\r')+ { skip(); } ;
>
> As previously stated, your rule here will cause the lexer to just barf on a character that is invalid. So if you construct the set of characters that cannot be anything else in your token set and use that in your while loop then you will be able to check the INDETIFER you pick up and validate it, resulting in a much nicer error message. If you can rely on the input being good, then you perhaps don't need to worry about that.
>

Unfortunately this does not work. When you try to match, say,
'classification' it breaks it into CLASS token and 'ification'
IDENTIFIER. The problem with original example I posted is that,
concluding from tokens DFA, after successful matching of a keyword
lexer tries to look beyond checking whether isIdentifierStart(LA(1))
predicate holds and checking whether it does not hold. In both cases
it makes assumption that IDENITIFER may start form anywhere (at least
that's my opinion) completely ignoring isJavaIdentifierPart guard. It
should try to match isJavaIdentifierPart(LA(1)) instead so I treat as
another bug (sigh). This partially works if I change the identifier
rule to: {Character.isJavaIdentifierStart(input.LA(1))}?=>{
Character.isJavaIdentifierPart(input.LA(1))=> .  }* which is mostly
fine because every identifier start character can also be identifier
part but then lexer explodes with myriads of states and generation
mostly ends abruptly with OutOfMemory, not to mention that the result
would probably not be very efficient. That's mostly because every
transition is accompanied with two additional predicate checks for
(another sigh). I am resigned - I expected problems with large
grammars but I've never suspected that I would be fighting mostly with
identifier matching. I am not sure if I remember correctly, but that
kind of problem was easily solvable by 'keywords' concept in ANTLRv2.
It seems that better is the enemy of good once more. Thank you very
much for your help Jim.

-- 
Greetings
Marcin Rzeźnicki