[antlr-interest] Lexer and Java keywords

Wed Dec 9 23:59:39 PST 2009

No - this is the wrong technique as what happens is that the lexer is simpler but still rejects malformed identifiers in the wrong way. You have to look for a valid start character, then consume until something MUST be something other than an identifier character. What you are looking to do is interpolate an indentifier that has invalid characters, then issue "Identifiers cannot contain character 'xxxx'" etc. The trick is to not match characters that are identifiers but stop on characters that definitely cannot be. There is a subset that reduces the error margins considerably. Otherwise you throw lexical errors and bunches of unrelated errors.

Jim

> -----Original Message-----
> From: antlr-interest-bounces at antlr.org [mailto:antlr-interest-
> bounces at antlr.org] On Behalf Of David-Sarah Hopwood
> Sent: Wednesday, December 09, 2009 10:09 PM
> To: antlr-interest at antlr.org
> Subject: Re: [antlr-interest] Lexer and Java keywords
> 
> Jim Idle wrote:
> > The issue is that your lexer is too complicated for the standard
> timeout on analysis values. Use:
> >
> > -Xconversiontimeout=32000
> >
> > And it will generate just fine.
> [...]
> 
> This is probably due to listing the character ranges for JavaLetter and
> JavaLetterOrDigit explicitly. Using the technique below (based on code
> from the ECMAScript 3 grammar by Patrick Hulsmeijer) will probably
> allow the lexer to be small enough to generate with the default
> timeout. Note that you'll have to adjust this for any differences
> between the identifier syntax language you're trying to parse, and that
> of Java -- I notice that you had '\u0000'..'\u0008' |
> '\u000e'..'\u001b' in JavaLetterOrDigit, for example.
> 
> 
> fragment IdentifierStartASCII
>   : 'a'..'z'
>   | 'A'..'Z'
>   | '$'
>   | '_'
>   ;
> 
> fragment IdentifierPart
>   : IdentifierStartASCII
>   | '0'..'9'
>   | { Character.isJavaIdentifierPart(input.LA(1)) }?
>       { matchAny(); }
>   ;
> 
> // This generates mIdentifierRest() used below.
> fragment IdentifierRest
>   : IdentifierPart*
>   ;
> 
> IDENTIFIER
>   : IdentifierStartASCII IdentifierRest
>   | { if (!Character.isJavaIdentifierStart(input.LA(1))) {
>         throw new NoViableAltException("identifier start", 0, 0,
> input);
>       }
>       matchAny(); mIdentifierRest(); }
>   ;
> 
> --
> David-Sarah Hopwood  ⚥  http://davidsarah.livejournal.com