[antlr-interest] Lexer and Java keywords

Thu Dec 10 08:27:19 PST 2009

On Thu, Dec 10, 2009 at 8:59 AM, Jim Idle <jimi at temporal-wave.com> wrote:
> No - this is the wrong technique as what happens is that the lexer is simpler but still rejects malformed identifiers in the wrong way. You have to look for a valid start character, then consume until something MUST be something other than an identifier character. What you are looking to do is interpolate an indentifier that has invalid characters, then issue "Identifiers cannot contain character 'xxxx'" etc. The trick is to not match characters that are identifiers but stop on characters that definitely cannot be. There is a subset that reduces the error margins considerably. Otherwise you throw lexical errors and bunches of unrelated errors.
>

I approached the problem as you suggested - using semantic predicates.
I'll have yet to test how it behaves when malformed input is read, but
I think this change made the parser more efficient. I transformed
IDENTIFIER rule to:

IDENTIFIER
  :
  {Character.isJavaIdentifierStart(input.LA(1))}?=> . (
{Character.isJavaIdentifierPart(input.LA(1))}?=> . )*
  ;

-- 
Greetings
Marcin Rzeźnicki