[antlr-interest] Lexer and Java keywords

Thu Dec 10 09:04:55 PST 2009

You're making this too complicated. Parse the identifier as loosely as absolutely possible. Many improper identifiers actually don't cause any problems in parsing, so you can treat them as valid and provide compiler error messages like semantics problems in post-AST analysis - the identifiers are just string literal keys to reference code constructs. After you perform semantic analysis check each identifier (variable and method names, etc.) by calling the Character class methods. Log the errors, but you don't have to stop the analysis from just that.

The general rule is don't engineer your parser to fail until you can no longer provide useful error messages. You can always manually stop early - for example sometimes I throw an OperationCancelledException in an error listener to stop a background parse for IDE IntelliSense after a user-specified number of errors are logged.

I may have missed a couple chars that are used by other language constructs (Jim?), but this should be close:

IDENTIFIER
    :   IDENTIFIER_START
        IDENTIFIER_CHAR* 
    ;

fragment
IDENTIFIER_START
    : ~(OPERATOR_CHAR | LITERAL_CHAR | DIGIT | WS_CHAR)
    ;

fragment
IDENTIFIER_CHAR
    : ~(OPERATOR_CHAR | LITERAL_CHAR | WS_CHAR)
    ;

fragment
OPERATOR_CHAR
    : '+' | '-' | '~' | '!' | '*' | '/' | '%'
    | '<' | '>' | '=' | '&' | '^' | '|' | '?' | ':'
    | ';' | '\\' | '.'
    ;

fragment
LITERAL_CHAR
    : '"' | '\''
    ;

-----Original Message-----
From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-bounces at antlr.org] On Behalf Of Marcin Rzeznicki
Sent: Thursday, December 10, 2009 10:27 AM
To: Jim Idle
Cc: antlr-interest at antlr.org
Subject: Re: [antlr-interest] Lexer and Java keywords

On Thu, Dec 10, 2009 at 8:59 AM, Jim Idle <jimi at temporal-wave.com> wrote:
> No - this is the wrong technique as what happens is that the lexer is simpler but still rejects malformed identifiers in the wrong way. You have to look for a valid start character, then consume until something MUST be something other than an identifier character. What you are looking to do is interpolate an indentifier that has invalid characters, then issue "Identifiers cannot contain character 'xxxx'" etc. The trick is to not match characters that are identifiers but stop on characters that definitely cannot be. There is a subset that reduces the error margins considerably. Otherwise you throw lexical errors and bunches of unrelated errors.
>

I approached the problem as you suggested - using semantic predicates.
I'll have yet to test how it behaves when malformed input is read, but
I think this change made the parser more efficient. I transformed
IDENTIFIER rule to:

IDENTIFIER
  :
  {Character.isJavaIdentifierStart(input.LA(1))}?=> . (
{Character.isJavaIdentifierPart(input.LA(1))}?=> . )*
  ;

-- 
Greetings
Marcin Rzeźnicki

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address